A structured glossary of data buzzwords

A structured glossary of data buzzwords

“Big data”, “data science”, “machine-learning”. Data is surrounded by many buzzwords, some of which are often used interchangeably. This can quickly become confusing.

Here we present a glossary that explains all essential data concepts in simple words. Related buzzwords are grouped together to help you see the structure in the list. Reading this glossary will help you improve your understanding of everything that data entails, and what it can do for your company.

Data: Information measured through observation.

Big data: Big data isn’t just any big dataset, but can be loosely defined through the ‘five V’s’ of big data: 1) volume, the size of the dataset, 2) velocity, the speed at which new data is being collected and needs to be analyzed, 3) variety, the diversity of data types, 4) veracity, the quality and accuracy of the data, and 5) value, the added value of analyzing all the data.

Dataset: A standardized set of measurements. Usually consists of a table, with observations in the rows, and measured properties in the columns. For example: a table where each row represents a customer transaction on a webshop, with columns representing data about the transaction (who bought what).

Variable: Data that is defined and measured in a particular way. Also called the column or feature in a dataset.

Data analytics: Analyzing data using statistics in order to describe the current data. Also called descriptive analytics.

A/B testing: Hypothesis testing by comparing two (or more) scenarios using statistics. For example: by offering half of the users website-layout #1 and offering the other half of the users website-layout #2 and comparing their behaviour. Also called confirmatory data analysis.

Business intelligence (BI): Using data analytics to inform you on the needed business strategy.

Dashboard: Plots and summarized numbers of data, often interactive. Can be used to describe the current status of a company, for example showing a graph of the profit made each month over the last years.

Exploratory data analysis: Exploring data using descriptive analytics to get an overview of the type and contents of each variable, their relationships with other variables, and data quality.

Statistics: Methods for analysing data based on probability theory and the (assumed) distribution of the data. Related to machine learning.

Correlation: Describes the way that two variables are associated. A positive correlation means that two variables change in the same direction: when A is bigger, B is also bigger. A negative correlation means that two variables change in the opposite direction: when A is bigger, B is smaller.

Data engineering: Preparing data so that it can be used for data analytics and data science.

Database: A storage space for all kinds of datasets. Can also be relational, where different datasets relate to each other in some way.

Data hub: A data hub is a central storage for structured data.

Data lake: A data lake is storage of all produced data, containing all raw data such as files, as well as transformed data such as tables. A very messy data lake (especially when containing big data) is also called a data swamp, indicating that it has become unusable.

Data warehouse: A warehouse for multiple databases, such as containing raw data, data about data, and summary data.

Data pipeline: A data pipeline is like an assembly line for data, where ETL processes are automated.

ETL: ‘Extract, Transform and Load’ refers to the process of extracting data from one or more sources, transforming them into a usable format and loading the transformed data into a database.

Data science: Analysing data using machine learning or artificial intelligence in order to find patterns and predict new data. Involves training an algorithm on (big) data, where it learns which features are important (and their level of importance) in order to best predict new data.

Algorithm: A set of instructions given to a computer, designed to achieve a certain task. Often based on mathematical models.

Artificial Intelligence (AI): The use of machine-learning algorithms to let machines complete ‘intelligent’ tasks, such as: reasoning, knowledge representation, planning, learning, processing language, perception, and the ability to move and manipulate objects.

Natural language processing (NLP): The automated analysis of written or spoken language. For example: the automated translation between two languages. Related concepts: ‘speech recognition’, ‘natural language understanding’, and ‘natural language generation’.

Data mining: A true buzzword, this fuzzy term is often used to refer to data analytics, data science, machine learning, or artificial intelligence. Its real meaning is using machine learning and statistics to uncover patterns in big data.

Feature engineering: After exploratory data analysis, the features / variables can be combined or transformed in creative ways, in order to create new variables that are better suited to the goal. For example, a variable containing dates can be transformed to indicate whether it was a weekday or weekend.

Feature selection: After feature engineering, the dataset contains many variables, which cause all kinds of problems for the use in algorithms (such as reduced interpretability or overfitting). Therefore, feature selection is used to carefully choose the combination of most important features that will be used to predict new data. Machine learning algorithms can also be used to help reduce the number of features.

Machine learning: An algorithm executed by a computer where the exact instructions are not entered, but instead are learned by analysing the data, with the goal of gaining knowledge. Is also called predictive analytics.

Classification problem: The use of supervised machine learning to predict categorical data. Examples of classification algorithms are logistic regression, decision trees, and k-nearest neighbors.

Neural network: A machine learning method, where a biological neural network is simulated, and trained to a specific task. Typically used to solve artificial intelligence problems. Also called artificial neural networks.

Deep learning: A form of neural networks that has at least two hidden layers of neurons, giving it more computing power for more complicated machine learning tasks.

Regression problem: The use of supervised machine learning to predict continuous numerical data. Examples of regression algorithms are linear regression and support vector machines.

Supervised learning: Machine learning algorithms and methods where the expected result has been clearly defined and measured, so that this can be used as examples to help train the algorithm.

Unsupervised learning: Machine learning algorithms and methods where the expected result is not explicitly defined. Can be further divided into either clustering or association (also called principal component) analysis.

Other

Blockchain: A way to spread data across multiple locations without a central administrator (i.e., electronic ledger), used to securely record transactions. Mainly used for cryptocurrencies such as bitcoin.

Internet of things: Devices that have wireless internet transmission embedded ― such as cars, smartwatches, or smartphones ― send their user data to a internet-of-things platform. Here, data is collected and analyzed, to be further used by the manufacturer or user.

Eefje Poppelaars, PhD

Eefje Poppelaars started working as a data scientist at NISI in February 2020. She has a background in scientific research, completing a PhD in neuroscience and psychology at Salzburg University (Austria), as well as a Bachelor and Research Master with honours at Leiden University.

As a curious and investigative Data Scientist, she is passionate about finding out what is going on in the data and thinking of creative ways to tackle problems. She is also quick to learn new skills and is continuously developing herself. Eefje is driven to develop data science solutions to help companies get the most out of their data.

A week before the Corona virus situation urged everyone to work from home, she started working for PostNL at the Data Solutions department. Here, she helped develop their 'Adres in Beeld' API, which helps webshops gain insights into addresses in order to optimize their marketing and sales. This challenging project involved aspects of data engineering, data analytics, and data science, and enabled her to use a diverse skillset.

Get In Touch