Introduction

About the Company

Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

Problems

The goal is to analyze smart device data to unlock new growth opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women.

Ask: Define the Business Objective

Question: How can a wellness technology company play it smart?

Answer: Identify and analyze consumer usage habits from the provided fitness data to determine actionable insights that can inform Bellabeat’s marketing strategy and drive product development.

Business Task: Analyze multi-dimensional smart device data to identify gaps in consumer wellness habits. Use these insights to position Bellabeat as a holistic lifestyle brand (active + restorative) rather than just a fitness tracker.

Data Cleaning and Processing

I use public data that explores smart device users’ daily habits which display in Kaggle: FitBit Fitness Tracker Data

This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

The data is from a public third-party fitness tracker, not Bellabeat’s users, which introduces potential bias and limits generalizability to Bellabeat’s specific, female-focused target audience.

Step 1: Set up the environment

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)

Step 2: Data loading

Dataset Group	Files Selected	Purpose
A. Macro Habits	dailyActivity_merged.csv	The master table for steps, distance, and calories.
B. Micro Habits	hourlySteps.csv, hourlyCalories.csv, hourlyIntensities.csv	Critical for identifying when users are active (Time-of-Day analysis).
C. Wellness & Health	sleepDay_merged.csv, heartrate_seconds.csv	To analyze recovery, stress, and sleep quality.
D. User Engagement	weightLogInfo_merged.csv	To analyze feature adoption rates (manual vs. auto-logging).

# Import the "dailyActivity_merged" dataset.

dailyActivity_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Import the "hourlyCalories_merged" dataset.

hourlyCalories_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")

## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Import the "hourlySteps_merged" dataset.

hourlySteps_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")

## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Import the "hourlyIntensities_merged" dataset.

hourlyIntensities_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")

## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Import the "weightLogInfo_merged" dataset.

weightLogInfo_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Import the "sleepDay_merged" dataset.

sleepDay_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(sleepDay_merged)

## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320

# Import the "heartrate_seconds_merged" dataset.

heartrate_seconds_merged <- read_csv("Documents/2025_Coursera_DataAnalyst/FitBit Fitness Tracker Data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 3: Data Cleaning

Before analyzing the data, we have to make sure the data is clean. The following will show each steps to clean each dataset.

For “dailyActivity_merged” dataset, we need to use the following cleaning steps:
1. Convert ActivityDate column to YYYY-MM-DD (date only), and Id to a character type.
2. Rename the column “ActivityDate” to “Date”.
3. Duplicate check.
4. Check if there are any missing value.

# Convert ActivityDate to YYYY-MM-DD (date only) and convert Id as a character.
dailyActivity <- dailyActivity_merged %>%
  mutate( ActivityDate = as.Date(ActivityDate, "%m/%d/%Y"),
         Id = as.character(Id)) 

#Rename the column "ActivityDate" to "Date".
colnames (dailyActivity)[colnames (dailyActivity) == "ActivityDate"]<- "Date"

#Duplicate check.
dailyActivity <- unique(dailyActivity)

#Check the missing value
sum(is.na(dailyActivity))

## [1] 0

head(dailyActivity)

## # A tibble: 6 × 15
##   Id         Date       TotalSteps TotalDistance TrackerDistance
##   <chr>      <date>          <dbl>         <dbl>           <dbl>
## 1 1503960366 2016-04-12      13162          8.5             8.5 
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-14      10460          6.74            6.74
## 4 1503960366 2016-04-15       9762          6.28            6.28
## 5 1503960366 2016-04-16      12669          8.16            8.16
## 6 1503960366 2016-04-17       9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

For “hourlyCalories_merged” dataset, we need to use the following cleaning steps:
1. Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
2. Duplicate check.
3. Check if there are any missing value.

#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
hourlyCalories <- hourlyCalories_merged %>%
  mutate( ActivityHour = as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
         Id = as.character(Id)) 
#Duplicate check.
hourlyCalories <- unique(hourlyCalories)

#Check if there are any missing value.
sum(is.na(hourlyCalories))

## [1] 0

head(hourlyCalories)

## # A tibble: 6 × 3
##   Id         ActivityHour        Calories
##   <chr>      <dttm>                 <dbl>
## 1 1503960366 2016-04-12 00:00:00       81
## 2 1503960366 2016-04-12 01:00:00       61
## 3 1503960366 2016-04-12 02:00:00       59
## 4 1503960366 2016-04-12 03:00:00       47
## 5 1503960366 2016-04-12 04:00:00       48
## 6 1503960366 2016-04-12 05:00:00       48

For “hourlySteps_merged” dataset, we need to use the following cleaning steps:
1. Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS and Id type to character type.
2. Rename the column “StepTotal” to “Steps”
3. Duplicate check.
4. Check if there are any missing value.

#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS and Id type to character type.
hourlySteps <- hourlySteps_merged %>%
  mutate( ActivityHour = as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
         Id = as.character(Id)) 

#Rename the column "StepTotal" to "Steps"
colnames(hourlySteps)[colnames(hourlySteps) == "StepTotal"] <- "Steps"

#Duplicate check.
hourlySteps <- unique(hourlySteps)

#Check if there are any missing value.
sum(is.na(hourlySteps))

## [1] 0

head(hourlySteps)

## # A tibble: 6 × 3
##   Id         ActivityHour        Steps
##   <chr>      <dttm>              <dbl>
## 1 1503960366 2016-04-12 00:00:00   373
## 2 1503960366 2016-04-12 01:00:00   160
## 3 1503960366 2016-04-12 02:00:00   151
## 4 1503960366 2016-04-12 03:00:00     0
## 5 1503960366 2016-04-12 04:00:00     0
## 6 1503960366 2016-04-12 05:00:00     0

For “hourlyIntensities_merged” dataset, we need to use the following cleaning steps:
1. Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
2. Duplicate check.
3. Check if there are any missing value.

#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
hourlyIntensities <- hourlyIntensities_merged %>%
  mutate( ActivityHour = as_datetime(ActivityHour, format = "%m/%d/%Y %I:%M:%S %p"),
         Id = as.character(Id)) 

#Duplicate check.
hourlyIntensities <- unique(hourlyIntensities)

#Check if there are any missing value.
sum(is.na(hourlyIntensities))

## [1] 0

head(hourlyIntensities)

## # A tibble: 6 × 4
##   Id         ActivityHour        TotalIntensity AverageIntensity
##   <chr>      <dttm>                       <dbl>            <dbl>
## 1 1503960366 2016-04-12 00:00:00             20            0.333
## 2 1503960366 2016-04-12 01:00:00              8            0.133
## 3 1503960366 2016-04-12 02:00:00              7            0.117
## 4 1503960366 2016-04-12 03:00:00              0            0    
## 5 1503960366 2016-04-12 04:00:00              0            0    
## 6 1503960366 2016-04-12 05:00:00              0            0

For “sleepDay_merged” dataset, we need to use the following cleaning steps:
1. Convert the SleepDay column to YYYY-MM-DD, and Id type to character type.
2. Rename the column “SleepDay” to “Date”
3. Duplicate check.
4. Check if there are any missing value.

#Convert the ActivityHour column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
sleepDay <- sleepDay_merged %>%
  mutate(
    Date = as_date(as_datetime(SleepDay, format = "%m/%d/%Y %I:%M:%S %p")),
    Id = as.character(Id)
  ) %>%
  select(-SleepDay) %>% 
  distinct()
#Duplicate check.
sleepDay <- unique(sleepDay)

#Check if there are any missing value.
sum(is.na(sleepDay))

## [1] 0

head(sleepDay)

## # A tibble: 6 × 5
##   Id         TotalSleepRecords TotalMinutesAsleep TotalTimeInBed Date      
##   <chr>                  <dbl>              <dbl>          <dbl> <date>    
## 1 1503960366                 1                327            346 2016-04-12
## 2 1503960366                 2                384            407 2016-04-13
## 3 1503960366                 1                412            442 2016-04-15
## 4 1503960366                 2                340            367 2016-04-16
## 5 1503960366                 1                700            712 2016-04-17
## 6 1503960366                 1                304            320 2016-04-19

For “heartrate_seconds_merged” dataset, I use the following cleaning steps:
1. Convert the Time column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
2. Group by Id, Date and Hour and calculate the mean of the Value column to get AvgHourlyHeartRate.
3. Remove any records with AvgHeartRate values that are biologically impossible (e.g., < 40 or > 220, though generally less critical here as the device filters these).
4. Check if there are any missing value.

#Convert the Time column to YYYY-MM-DD HH:MM:SS, and Id type to character type.
heartrate_seconds <- heartrate_seconds_merged %>%
  mutate( Time = as_datetime(Time, format = "%m/%d/%Y %I:%M:%S %p"),
         Id = as.character(Id)) 


head(heartrate_seconds)

## # A tibble: 6 × 3
##   Id         Time                Value
##   <chr>      <dttm>              <dbl>
## 1 2022484408 2016-04-12 07:21:00    97
## 2 2022484408 2016-04-12 07:21:05   102
## 3 2022484408 2016-04-12 07:21:10   105
## 4 2022484408 2016-04-12 07:21:20   103
## 5 2022484408 2016-04-12 07:21:25   101
## 6 2022484408 2016-04-12 07:22:05    95

For “weightLogInfo_merged.csv” dataset, I use the following cleaning steps:
1. Convert the Date column (which contains time) to Date only (YYYY-MM-DD).
2. Check for missing values in WeightKg, BMI, and Fat. Since the sample size is only 8 users, decide whether to impute or simply exclude records/rows with missing data.
3. If a user logged weight more than once on the same date, take the last entry for that day as the most current.

#Convert the Date column (which contains time) to Date only (YYYY-MM-DD), and extracts the Hour (00-23).
weightLogInfo <- weightLogInfo_merged %>%
  mutate( Date = as_datetime(Date, format = "%m/%d/%Y %I:%M:%S %p"),
          Hour = lubridate::hour(Date),
          Dateonly = as_date(Date),
         Id = as.character(Id)) 
#  select(Id, Dateonly, Hour, WeightKg, BMI, Fat, IsManualReport) 

#Check for missing values in WeightKg, BMI, and Fat.

cat("Null in Weight", sum(is.na(weightLogInfo$WeightKg)),"\n")

## Null in Weight 0

cat("Null in BMI", sum(is.na(weightLogInfo$BMI)),"\n")

## Null in BMI 0

cat("Null in Fat", sum(is.na(weightLogInfo$Fat)),"\n")

## Null in Fat 65

#If a user logged weight more than once on the same date, take the last entry for that day as the most current.
weight_daily <- weightLogInfo %>%
  group_by(Id, Dateonly) %>%
  slice_max(order_by = Dateonly, n = 1, with_ties = FALSE) %>%
  ungroup()

head(weightLogInfo)

## # A tibble: 6 × 10
##   Id        Date                WeightKg WeightPounds   Fat   BMI IsManualReport
##   <chr>     <dttm>                 <dbl>        <dbl> <dbl> <dbl> <lgl>         
## 1 15039603… 2016-05-02 23:59:59     52.6         116.    22  22.6 TRUE          
## 2 15039603… 2016-05-03 23:59:59     52.6         116.    NA  22.6 TRUE          
## 3 19279722… 2016-04-13 01:08:52    134.          294.    NA  47.5 FALSE         
## 4 28732127… 2016-04-21 23:59:59     56.7         125.    NA  21.5 TRUE          
## 5 28732127… 2016-05-12 23:59:59     57.3         126.    NA  21.7 TRUE          
## 6 43197035… 2016-04-17 23:59:59     72.4         160.    25  27.5 TRUE          
## # ℹ 3 more variables: LogId <dbl>, Hour <int>, Dateonly <date>

dim(weightLogInfo)

## [1] 67 10

head(weight_daily)

## # A tibble: 6 × 10
##   Id        Date                WeightKg WeightPounds   Fat   BMI IsManualReport
##   <chr>     <dttm>                 <dbl>        <dbl> <dbl> <dbl> <lgl>         
## 1 15039603… 2016-05-02 23:59:59     52.6         116.    22  22.6 TRUE          
## 2 15039603… 2016-05-03 23:59:59     52.6         116.    NA  22.6 TRUE          
## 3 19279722… 2016-04-13 01:08:52    134.          294.    NA  47.5 FALSE         
## 4 28732127… 2016-04-21 23:59:59     56.7         125.    NA  21.5 TRUE          
## 5 28732127… 2016-05-12 23:59:59     57.3         126.    NA  21.7 TRUE          
## 6 43197035… 2016-04-17 23:59:59     72.4         160.    25  27.5 TRUE          
## # ℹ 3 more variables: LogId <dbl>, Hour <int>, Dateonly <date>

dim(weight_daily)

## [1] 67 10

From the results,
1. We have lots of NA in Fat column. I still keep the data since they have others information, such as BMI, Weight, to analyze.
2. We don’t have the duplicate weight data for the same user at the same date.

Step 4: Data Processing

I create three Master Tables for analysis.

Table’s name	Action	Why	Feature Engineering
Daily_Holistic	Perform an merge on dailyActivity_merged and sleepDay_merged using Id and Date.	To correlate activity (Steps) with recovery (Sleep Efficiency).	1. Create a new column DayOfWeek (Mon, Tue, Wed) to analyze weekly routines. 2. Sleep Efficiency: Calculate RestlessMinutes = TotalTimeInBed - TotalMinutesAsleep. 3. Day of Week: Extract DayOfWeek (e.g., “Monday”) from the Date column for trend analysis.
Hourly_Rhythm	inner_join the three hourly files (Steps, Calories, Intensities) on Id and ActivityHour.	Steps alone don’t tell the whole story. High intensity with low steps might indicate weightlifting or yoga.	Extract the Hour (0-23) into a separate column for aggregation.
Health_Metrics	1. Group by Id, Date and Hour and calculate the mean of the Value column to get AvgHourlyHeartRate. 2. Remove any records with AvgHeartRate values that are biologically impossible (e.g., < 40 or > 220, though generally less critical here as the device filters these). 3. Check if there are any missing value.	Second-level data is too noisy. Hourly averages allow you to overlay Heart Rate trends on top of your Activity peaks.

Daily_Holistic <- left_join( dailyActivity, sleepDay, by = c("Id", "Date")) %>%

  mutate(
    
    # Extracts the DayOWeek component
    
    DayOWeek = wday(Date, label = TRUE , abbr = FALSE), 
    
    # Extracts the TotalActiveMinutes 
    
    TotalActiveMinutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes,
    
    #Extracts the RestlessMinutes
    
    RestlessMinutes = TotalTimeInBed - TotalMinutesAsleep #Time spent awake in the bed
  ) 
  
cat("Original numbers:", nrow(Daily_Holistic), "\n")

## Original numbers: 940

cat("Null numbers in RestlessMinutes", sum(is.na(Daily_Holistic$RestlessMinutes)),"\n")

## Null numbers in RestlessMinutes 530

print(Daily_Holistic %>% filter(!is.na(RestlessMinutes)) %>% head())

## # A tibble: 6 × 21
##   Id         Date       TotalSteps TotalDistance TrackerDistance
##   <chr>      <date>          <dbl>         <dbl>           <dbl>
## 1 1503960366 2016-04-12      13162          8.5             8.5 
## 2 1503960366 2016-04-13      10735          6.97            6.97
## 3 1503960366 2016-04-15       9762          6.28            6.28
## 4 1503960366 2016-04-16      12669          8.16            8.16
## 5 1503960366 2016-04-17       9705          6.48            6.48
## 6 1503960366 2016-04-19      15506          9.88            9.88
## # ℹ 16 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## #   TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>,
## #   DayOWeek <ord>, TotalActiveMinutes <dbl>, RestlessMinutes <dbl>

Hourly_Rhythm <- hourlyCalories %>%
  inner_join(hourlyIntensities, by = c("Id", "ActivityHour")) %>%
  inner_join(hourlySteps, by = c("Id", "ActivityHour")) %>%
  mutate(
    Date = as_date(ActivityHour),
    Hour = lubridate::hour(ActivityHour)
  )

head(Hourly_Rhythm)

## # A tibble: 6 × 8
##   Id         ActivityHour        Calories TotalIntensity AverageIntensity Steps
##   <chr>      <dttm>                 <dbl>          <dbl>            <dbl> <dbl>
## 1 1503960366 2016-04-12 00:00:00       81             20            0.333   373
## 2 1503960366 2016-04-12 01:00:00       61              8            0.133   160
## 3 1503960366 2016-04-12 02:00:00       59              7            0.117   151
## 4 1503960366 2016-04-12 03:00:00       47              0            0         0
## 5 1503960366 2016-04-12 04:00:00       48              0            0         0
## 6 1503960366 2016-04-12 05:00:00       48              0            0         0
## # ℹ 2 more variables: Date <date>, Hour <int>

#Group by Id and Hour and calculate the mean of the Value column to get AvgHourlyHeartRate.
heartrate_seconds <- heartrate_seconds %>%
  mutate(Hour = lubridate::hour(Time),
         Date = as_date(Time))

heartrate_avg  <- heartrate_seconds %>%
  group_by(Id, Date, Hour)%>%
  summarise(
    AvgHourlyHeartRate = mean(Value),
    .groups = 'drop'
  ) 

#Remove any records with AvgHeartRate values that are biologically impossible (e.g., < 40 or > 220, though generally less critical here as the device filters these).

Health_Metrics <- heartrate_avg %>%
  filter(
    AvgHourlyHeartRate >= 40,  # Remove the records which are less than 40 
    AvgHourlyHeartRate <= 220  # Remove the records which are greater than 220 
  )
records_removed <- nrow(heartrate_avg) - nrow(Health_Metrics)

cat("Original numbers:", nrow(heartrate_avg), "\n")

## Original numbers: 6013

cat("Remove numbers:", records_removed, "\n")

## Remove numbers: 0

cat("After Filter:", nrow(Health_Metrics), "\n")

## After Filter: 6013

#Check if there are any missing value.
sum(is.na(Health_Metrics))

## [1] 0

head(Health_Metrics)

## # A tibble: 6 × 4
##   Id         Date        Hour AvgHourlyHeartRate
##   <chr>      <date>     <int>              <dbl>
## 1 2022484408 2016-04-12     7               83.2
## 2 2022484408 2016-04-12     8               68.6
## 3 2022484408 2016-04-12     9               66.4
## 4 2022484408 2016-04-12    10              107. 
## 5 2022484408 2016-04-12    11               67.8
## 6 2022484408 2016-04-12    12               66.2

weight_clean <- weightLogInfo %>%
    select(Id, Date, WeightKg, BMI, IsManualReport) %>%
      group_by(Id, Date) %>%
      slice_max(order_by = WeightKg, n=1, with_ties = FALSE) %>%
      ungroup() %>%
      distinct()

Hourly_Rhythm <- full_join(
    Hourly_Rhythm, 
    Health_Metrics,
    by = c("Id", "Date", "Hour")
  )
dim(Hourly_Rhythm)

## [1] 22106     9

sum(is.na(Hourly_Rhythm$AvgHourlyHeartRate))

## [1] 16093

print(Hourly_Rhythm %>% filter(!is.na(AvgHourlyHeartRate)) %>% head())

## # A tibble: 6 × 9
##   Id         ActivityHour        Calories TotalIntensity AverageIntensity Steps
##   <chr>      <dttm>                 <dbl>          <dbl>            <dbl> <dbl>
## 1 2022484408 2016-04-12 07:00:00      136             28            0.467   847
## 2 2022484408 2016-04-12 08:00:00       99             13            0.217   334
## 3 2022484408 2016-04-12 09:00:00       90             13            0.217   243
## 4 2022484408 2016-04-12 10:00:00      369            143            2.38   5243
## 5 2022484408 2016-04-12 11:00:00       99             20            0.333   323
## 6 2022484408 2016-04-12 12:00:00       86             11            0.183   184
## # ℹ 3 more variables: Date <date>, Hour <int>, AvgHourlyHeartRate <dbl>

Export the “Daily_Holistic” and ” Hourly_Rhythm” datasets, and make the chart in Tableau.

write_csv(Daily_Holistic, "Daily_Holistic_Cleaned.csv")
write_csv(Hourly_Rhythm, "Hourly_Rhythm_Cleaned.csv")

Act

Key findings and marketing recommendations are consolidated in the project README.

→ View Full Analysis on GitHub

How Can a Wellness Technology Company Play It Smart?V4

KaiWen Huang

2025-12-04

Introduction

About the Company

Problems

Ask: Define the Business Objective

Data Cleaning and Processing

Step 1: Set up the environment

Step 2: Data loading

Step 3: Data Cleaning

Step 4: Data Processing

Act

How Can a Wellness Technology Company Play It Smart?V4

KaiWen Huang

2025-12-04

Introduction

About the Company

Problems

Ask: Define the Business Objective

Data Cleaning and Processing

Step 1: Set up the environment

Step 2: Data loading

Step 3: Data Cleaning

Step 4: Data Processing

Share

Act