Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.
The goal is to analyze smart device data to unlock new growth opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women.
Question: How can a wellness technology company play it smart?
Answer: Identify and analyze consumer usage habits from the provided fitness data to determine actionable insights that can inform Bellabeat’s marketing strategy and drive product development.
Business Task: Analyze multi-dimensional smart device data to identify gaps in consumer wellness habits. Use these insights to position Bellabeat as a holistic lifestyle brand (active + restorative) rather than just a fitness tracker.
I use public data that explores smart device users’ daily habits which display in Kaggle: FitBit Fitness Tracker Data
This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
The data is from a public third-party fitness tracker, not Bellabeat’s users, which introduces potential bias and limits generalizability to Bellabeat’s specific, female-focused target audience.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
| Dataset Group | Files Selected | Purpose |
|---|---|---|
| A. Macro Habits | dailyActivity_merged.csv | The master table for steps, distance, and calories. |
| B. Micro Habits | hourlySteps.csv, hourlyCalories.csv, hourlyIntensities.csv | Critical for identifying when users are active (Time-of-Day analysis). |
| C. Wellness & Health | sleepDay_merged.csv, heartrate_seconds.csv | To analyze recovery, stress, and sleep quality. |
| D. User Engagement | weightLogInfo_merged.csv | To analyze feature adoption rates (manual vs. auto-logging). |
# Import the "dailyActivity_merged" dataset.
dailyActivity_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the "hourlyCalories_merged" dataset.
hourlyCalories_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the "hourlySteps_merged" dataset.
hourlySteps_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the "hourlyIntensities_merged" dataset.
hourlyIntensities_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the "weightLogInfo_merged" dataset.
weightLogInfo_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the "sleepDay_merged" dataset.
sleepDay_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sleepDay_merged)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
# Import the "heartrate_seconds_merged" dataset.
heartrate_seconds_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before analyzing the data, we have to make sure the data is clean. The following will show each steps to clean each dataset.
For “dailyActivity_merged” dataset, we need to use the following
cleaning steps:
1. Convert ActivityDate column to YYYY-MM-DD (date
only), and Id to a character type.
2. Rename the column
“ActivityDate” to “Date”.
3. Duplicate check.
4. Check if there
are any missing value.
# Convert ActivityDate to YYYY-MM-DD (date only) and convert Id as a character.
dailyActivity <- dailyActivity_merged %>%
mutate( ActivityDate = as.Date(ActivityDate, "%m/%d/%Y"),
Id = as.character(Id))
#Rename the column "ActivityDate" to "Date".
colnames (dailyActivity)[colnames (dailyActivity) == "ActivityDate"]<- "Date"
#Duplicate check.
dailyActivity <- unique(dailyActivity)
#Check the missing value
sum(is.na(dailyActivity))
## [1] 0
head(dailyActivity)
## # A tibble: 6 × 15
## Id Date TotalSteps TotalDistance TrackerDistance
## <chr> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
For “hourlyCalories_merged” dataset, we need to use the following
cleaning steps:
1. Convert the ActivityHour column to YYYY-MM-DD
HH:MM:SS, and Id type to character type.
2. Duplicate check.
3.
Check if there are any missing value.
#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
hourlyCalories <- hourlyCalories_merged %>%
mutate( ActivityHour = as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
Id = as.character(Id))
#Duplicate check.
hourlyCalories <- unique(hourlyCalories)
#Check if there are any missing value.
sum(is.na(hourlyCalories))
## [1] 0
head(hourlyCalories)
## # A tibble: 6 × 3
## Id ActivityHour Calories
## <chr> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 81
## 2 1503960366 2016-04-12 01:00:00 61
## 3 1503960366 2016-04-12 02:00:00 59
## 4 1503960366 2016-04-12 03:00:00 47
## 5 1503960366 2016-04-12 04:00:00 48
## 6 1503960366 2016-04-12 05:00:00 48
For “hourlySteps_merged” dataset, we need to use the following
cleaning steps:
1. Convert the ActivityHour column to YYYY-MM-DD
HH:MM:SS and Id type to character type.
2. Rename the column
“StepTotal” to “Steps”
3. Duplicate check.
4. Check if there are
any missing value.
#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS and Id type to character type.
hourlySteps <- hourlySteps_merged %>%
mutate( ActivityHour = as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
Id = as.character(Id))
#Rename the column "StepTotal" to "Steps"
colnames(hourlySteps)[colnames(hourlySteps) == "StepTotal"] <- "Steps"
#Duplicate check.
hourlySteps <- unique(hourlySteps)
#Check if there are any missing value.
sum(is.na(hourlySteps))
## [1] 0
head(hourlySteps)
## # A tibble: 6 × 3
## Id ActivityHour Steps
## <chr> <dttm> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
For “hourlyIntensities_merged” dataset, we need to use the following
cleaning steps:
1. Convert the ActivityHour column to YYYY-MM-DD
HH:MM:SS, and Id type to character type.
2. Duplicate check.
3.
Check if there are any missing value.
#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
hourlyIntensities <- hourlyIntensities_merged %>%
mutate( ActivityHour = as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
Id = as.character(Id))
#Duplicate check.
hourlyIntensities <- unique(hourlyIntensities)
#Check if there are any missing value.
sum(is.na(hourlyIntensities))
## [1] 0
head(hourlyIntensities)
## # A tibble: 6 × 4
## Id ActivityHour TotalIntensity AverageIntensity
## <chr> <dttm> <dbl> <dbl>
## 1 1503960366 2016-04-12 00:00:00 20 0.333
## 2 1503960366 2016-04-12 01:00:00 8 0.133
## 3 1503960366 2016-04-12 02:00:00 7 0.117
## 4 1503960366 2016-04-12 03:00:00 0 0
## 5 1503960366 2016-04-12 04:00:00 0 0
## 6 1503960366 2016-04-12 05:00:00 0 0
For “sleepDay_merged” dataset, we need to use the following cleaning
steps:
1. Convert the SleepDay column to YYYY-MM-DD, and Id type to
character type.
2. Rename the column “SleepDay” to “Date”
3.
Duplicate check.
4. Check if there are any missing value.
#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
sleepDay <- sleepDay_merged %>%
mutate(
Date = as_date(as_datetime(SleepDay, format = "%m/%d/%Y %I:%M:%S %p")),
Id = as.character(Id)
) %>%
select(-SleepDay) %>%
distinct()
#Duplicate check.
sleepDay <- unique(sleepDay)
#Check if there are any missing value.
sum(is.na(sleepDay))
## [1] 0
head(sleepDay)
## # A tibble: 6 × 5
## Id TotalSleepRecords TotalMinutesAsleep TotalTimeInBed Date
## <chr> <dbl> <dbl> <dbl> <date>
## 1 1503960366 1 327 346 2016-04-12
## 2 1503960366 2 384 407 2016-04-13
## 3 1503960366 1 412 442 2016-04-15
## 4 1503960366 2 340 367 2016-04-16
## 5 1503960366 1 700 712 2016-04-17
## 6 1503960366 1 304 320 2016-04-19
For “heartrate_seconds_merged” dataset, I use the following cleaning
steps:
1. Convert the Time column to YYYY-MM-DD HH:MM:SS, and Id
type to character type.
2. Group by Id, Date and Hour and calculate
the mean of the Value column to get AvgHourlyHeartRate.
3. Remove
any records with AvgHeartRate values that are biologically impossible
(e.g., < 40 or > 220, though generally less critical here as the
device filters these).
4. Check if there are any missing value.
#Convert the Time column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
heartrate_seconds <- heartrate_seconds_merged %>%
mutate( Time = as_datetime(Time, format = "%m/%d/%Y %I:%M:%S %p"),
Id = as.character(Id))
head(heartrate_seconds)
## # A tibble: 6 × 3
## Id Time Value
## <chr> <dttm> <dbl>
## 1 2022484408 2016-04-12 07:21:00 97
## 2 2022484408 2016-04-12 07:21:05 102
## 3 2022484408 2016-04-12 07:21:10 105
## 4 2022484408 2016-04-12 07:21:20 103
## 5 2022484408 2016-04-12 07:21:25 101
## 6 2022484408 2016-04-12 07:22:05 95
For “weightLogInfo_merged.csv” dataset, I use the following cleaning
steps:
1. Convert the Date column (which contains time) to Date only
(YYYY-MM-DD).
2. Check for missing values in WeightKg, BMI, and Fat.
Since the sample size is only 8 users, decide whether to impute or
simply exclude records/rows with missing data.
3. If a user logged
weight more than once on the same date, take the last entry for that day
as the most current.
#Convert the Date column (which contains time) to Date only (YYYY-MM-DD), and extracts the Hour (00-23).
weightLogInfo <- weightLogInfo_merged %>%
mutate( Date = as_datetime(Date, format = "%m/%d/%Y %I:%M:%S %p"),
Hour = lubridate::hour(Date),
Dateonly = as_date(Date),
Id = as.character(Id))
# select(Id, Dateonly, Hour, WeightKg, BMI, Fat, IsManualReport)
#Check for missing values in WeightKg, BMI, and Fat.
cat("Null in Weight", sum(is.na(weightLogInfo$WeightKg)),"\n")
## Null in Weight 0
cat("Null in BMI", sum(is.na(weightLogInfo$BMI)),"\n")
## Null in BMI 0
cat("Null in Fat", sum(is.na(weightLogInfo$Fat)),"\n")
## Null in Fat 65
#If a user logged weight more than once on the same date, take the last entry for that day as the most current.
weight_daily <- weightLogInfo %>%
group_by(Id, Dateonly) %>%
slice_max(order_by = Dateonly, n = 1, with_ties = FALSE) %>%
ungroup()
head(weightLogInfo)
## # A tibble: 6 × 10
## Id Date WeightKg WeightPounds Fat BMI IsManualReport
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 15039603… 2016-05-02 23:59:59 52.6 116. 22 22.6 TRUE
## 2 15039603… 2016-05-03 23:59:59 52.6 116. NA 22.6 TRUE
## 3 19279722… 2016-04-13 01:08:52 134. 294. NA 47.5 FALSE
## 4 28732127… 2016-04-21 23:59:59 56.7 125. NA 21.5 TRUE
## 5 28732127… 2016-05-12 23:59:59 57.3 126. NA 21.7 TRUE
## 6 43197035… 2016-04-17 23:59:59 72.4 160. 25 27.5 TRUE
## # ℹ 3 more variables: LogId <dbl>, Hour <int>, Dateonly <date>
dim(weightLogInfo)
## [1] 67 10
head(weight_daily)
## # A tibble: 6 × 10
## Id Date WeightKg WeightPounds Fat BMI IsManualReport
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 15039603… 2016-05-02 23:59:59 52.6 116. 22 22.6 TRUE
## 2 15039603… 2016-05-03 23:59:59 52.6 116. NA 22.6 TRUE
## 3 19279722… 2016-04-13 01:08:52 134. 294. NA 47.5 FALSE
## 4 28732127… 2016-04-21 23:59:59 56.7 125. NA 21.5 TRUE
## 5 28732127… 2016-05-12 23:59:59 57.3 126. NA 21.7 TRUE
## 6 43197035… 2016-04-17 23:59:59 72.4 160. 25 27.5 TRUE
## # ℹ 3 more variables: LogId <dbl>, Hour <int>, Dateonly <date>
dim(weight_daily)
## [1] 67 10
From the results,
1. We have lots of NA in Fat column. I still
keep the data since they have others information, such as BMI, Weight,
to analyze.
2. We don’t have the duplicate weight data for the same
user at the same date.
I create three Master Tables for analysis.
| Table’s name | Action | Why | Feature Engineering |
|---|---|---|---|
| Daily_Holistic | Perform an merge on dailyActivity_merged and sleepDay_merged using Id and Date. | To correlate activity (Steps) with recovery (Sleep Efficiency). | 1. Create a new column DayOfWeek (Mon, Tue, Wed) to analyze weekly
routines. 2. Sleep Efficiency: Calculate RestlessMinutes = TotalTimeInBed - TotalMinutesAsleep. 3. Day of Week: Extract DayOfWeek (e.g., “Monday”) from the Date column for trend analysis. |
| Hourly_Rhythm | inner_join the three hourly files (Steps, Calories, Intensities) on Id and ActivityHour. | Steps alone don’t tell the whole story. High intensity with low steps might indicate weightlifting or yoga. | Extract the Hour (0-23) into a separate column for aggregation. |
| Health_Metrics | 1. Group by Id, Date and Hour and calculate the mean of the Value
column to get AvgHourlyHeartRate. 2. Remove any records with AvgHeartRate values that are biologically impossible (e.g., < 40 or > 220, though generally less critical here as the device filters these). 3. Check if there are any missing value. |
Second-level data is too noisy. Hourly averages allow you to overlay Heart Rate trends on top of your Activity peaks. |
Daily_Holistic <- left_join( dailyActivity, sleepDay, by = c("Id", "Date")) %>%
mutate(
# Extracts the DayOWeek component
DayOWeek = wday(Date, label = TRUE , abbr = FALSE),
# Extracts the TotalActiveMinutes
TotalActiveMinutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes,
#Extracts the RestlessMinutes
RestlessMinutes = TotalTimeInBed - TotalMinutesAsleep #Time spent awake in the bed
)
cat("Original numbers:", nrow(Daily_Holistic), "\n")
## Original numbers: 940
cat("Null numbers in RestlessMinutes", sum(is.na(Daily_Holistic$RestlessMinutes)),"\n")
## Null numbers in RestlessMinutes 530
print(Daily_Holistic %>% filter(!is.na(RestlessMinutes)) %>% head())
## # A tibble: 6 × 21
## Id Date TotalSteps TotalDistance TrackerDistance
## <chr> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-15 9762 6.28 6.28
## 4 1503960366 2016-04-16 12669 8.16 8.16
## 5 1503960366 2016-04-17 9705 6.48 6.48
## 6 1503960366 2016-04-19 15506 9.88 9.88
## # ℹ 16 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>,
## # DayOWeek <ord>, TotalActiveMinutes <dbl>, RestlessMinutes <dbl>
Hourly_Rhythm <- hourlyCalories %>%
inner_join(hourlyIntensities, by = c("Id", "ActivityHour")) %>%
inner_join(hourlySteps, by = c("Id", "ActivityHour")) %>%
mutate(
Date = as_date(ActivityHour),
Hour = lubridate::hour(ActivityHour)
)
head(Hourly_Rhythm)
## # A tibble: 6 × 8
## Id ActivityHour Calories TotalIntensity AverageIntensity Steps
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 00:00:00 81 20 0.333 373
## 2 1503960366 2016-04-12 01:00:00 61 8 0.133 160
## 3 1503960366 2016-04-12 02:00:00 59 7 0.117 151
## 4 1503960366 2016-04-12 03:00:00 47 0 0 0
## 5 1503960366 2016-04-12 04:00:00 48 0 0 0
## 6 1503960366 2016-04-12 05:00:00 48 0 0 0
## # ℹ 2 more variables: Date <date>, Hour <int>
#Group by Id and Hour and calculate the mean of the Value column to get AvgHourlyHeartRate.
heartrate_seconds <- heartrate_seconds %>%
mutate(Hour = lubridate::hour(Time),
Date = as_date(Time))
heartrate_avg <- heartrate_seconds %>%
group_by(Id, Date, Hour)%>%
summarise(
AvgHourlyHeartRate = mean(Value),
.groups = 'drop'
)
#Remove any records with AvgHeartRate values that are biologically impossible (e.g., < 40 or > 220, though generally less critical here as the device filters these).
Health_Metrics <- heartrate_avg %>%
filter(
AvgHourlyHeartRate >= 40, # Remove the records which are less than 40
AvgHourlyHeartRate <= 220 # Remove the records which are greater than 220
)
records_removed <- nrow(heartrate_avg) - nrow(Health_Metrics)
cat("Original numbers:", nrow(heartrate_avg), "\n")
## Original numbers: 6013
cat("Remove numbers:", records_removed, "\n")
## Remove numbers: 0
cat("After Filter:", nrow(Health_Metrics), "\n")
## After Filter: 6013
#Check if there are any missing value.
sum(is.na(Health_Metrics))
## [1] 0
head(Health_Metrics)
## # A tibble: 6 × 4
## Id Date Hour AvgHourlyHeartRate
## <chr> <date> <int> <dbl>
## 1 2022484408 2016-04-12 7 83.2
## 2 2022484408 2016-04-12 8 68.6
## 3 2022484408 2016-04-12 9 66.4
## 4 2022484408 2016-04-12 10 107.
## 5 2022484408 2016-04-12 11 67.8
## 6 2022484408 2016-04-12 12 66.2
weight_clean <- weightLogInfo %>%
select(Id, Date, WeightKg, BMI, IsManualReport) %>%
group_by(Id, Date) %>%
slice_max(order_by = WeightKg, n=1, with_ties = FALSE) %>%
ungroup() %>%
distinct()
Hourly_Rhythm <- full_join(
Hourly_Rhythm,
Health_Metrics,
by = c("Id", "Date", "Hour")
)
dim(Hourly_Rhythm)
## [1] 22106 9
sum(is.na(Hourly_Rhythm$AvgHourlyHeartRate))
## [1] 16093
print(Hourly_Rhythm %>% filter(!is.na(AvgHourlyHeartRate)) %>% head())
## # A tibble: 6 × 9
## Id ActivityHour Calories TotalIntensity AverageIntensity Steps
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2022484408 2016-04-12 07:00:00 136 28 0.467 847
## 2 2022484408 2016-04-12 08:00:00 99 13 0.217 334
## 3 2022484408 2016-04-12 09:00:00 90 13 0.217 243
## 4 2022484408 2016-04-12 10:00:00 369 143 2.38 5243
## 5 2022484408 2016-04-12 11:00:00 99 20 0.333 323
## 6 2022484408 2016-04-12 12:00:00 86 11 0.183 184
## # ℹ 3 more variables: Date <date>, Hour <int>, AvgHourlyHeartRate <dbl>
Export the “Daily_Holistic” and ” Hourly_Rhythm” datasets, and make the chart in Tableau.
write_csv(Daily_Holistic, "Daily_Holistic_Cleaned.csv")
write_csv(Hourly_Rhythm, "Hourly_Rhythm_Cleaned.csv")
Key findings and marketing recommendations are consolidated in the project README.