Grouping data files with file extension and time

I have data where different files are available at different times. Each file has its own extension. Depending on these extensions, I want a cluster of these files (.pdf, .pptx, .srt) these extensions to be placed in a cluster named "WORK". In addition, files with extensions (.mp3, mp4, .jpg) must be stored in the Entertainment cluster, and the remaining files must be placed in the Other named cluster. I have to divide the time into two categories "day" between (6 am-6pm) and night (6pm-6am). Finally, I need to make a graph (histogram) that tells which cluster has more files and how many files per day and night. i don't know how to start it

Name                                                               Time
$R1XFFF3.JPG                                        11/04/2017 20:39:17
[Fall 2016] Duty Roaster, Final Term (1).xlsx       21/03/2017 01:33:48
04_OOP_Base.sln                                     16/03/2017 22:26:15
1 - 2 - What is Machine Learning- (7 min).pdf       02/04/2017 02:03:18
1 - 3 - Supervised Learning (12 min).jpg            02/04/2017 02:03:20
1 - 4 - Unsupervised Learning (14 min).mkv          02/04/2017 02:03:21
1.jpg                                               08/04/2017 19:02:55
1.png                                               17/03/2017 11:17:19
15-oop.ppt                                          16/03/2017 22:28:58
2 - 1 - Model Representation (8 min).srt            02/04/2017 02:03:21
2 - 2 - Cost Function (8 min).srt                   02/04/2017 02:03:22
2 - 3 - Cost Function - Intuition I (11 min).srt    02/04/2017 02:03:23
2 - 4 - Cost Function - Intuition II (9 min).srt    02/04/2017 02:03:23
2 - 5 - Gradient Descent (11 min).ppt               02/04/2017 02:03:39

      

+3


source to share


1 answer


You can use regex for categories and hour

from lubridate()

to check time condition

library(lubridate)
df$cluster = ifelse(grepl("(\\.pdf|\\.pptx|\\.srt)$", df$Name, perl = TRUE), "WORK", "Other")
df$cluster = ifelse(grepl("(\\.mp3|\\.mp4|\\.jpg)$", df$Name, perl = TRUE), "Entertainment", df$cluster)

df$Time = as.POSIXct(strptime(df$Time, "%m/%d/%Y %H:%M:%S"))

df$Time2 = ifelse(hour(df$Time) >=6 & hour(df$Time) <= 18, "Day", "Night")

      

To build a schedule, you can divide it into two separate sections: one for the night and one for the day.

barplot(table(df[df$Time2 == "Night", "cluster"]))

      



We can have these side by side if we run this in front of their graphics

par(mfrow = c(1, 2))

      

On the same plot

library(ggplot2)
p = ggplot(df, aes(cluster, fill = Time2))
p + geom_bar(position = "dodge")

      

+1


source







All Articles