regex - Text processing / regular expressions? in R -
i have data frame following columns.
user_id: g17165fd2e0bba9a449857645bb6g3a9a7ef8e6c time: 1361553741 url: string url.
the url, sometime, takes form https://something.com/name/forum/thread?thread_id=51
.
i want create data frame tells me each user, between time x , y, how many time or visited each thread_id. so, number of observations equal number of user , number of columns equal number of thread ids + 1(the total views)
the data set big, doing in parallel must.
what best way of doing in r ?
thanks lot!
ps: @david create code generates data frame 1 mentioned, , provided perfect answer question.
set.seed(2) #make junk data dat <- data.frame(user=1:5, time=1:20, url=paste0("https://domain.com/forum/thread? thread_id=",sample(5,20,t)))
pretty sure work you:
> library(plyr) > library(domc) > library(reshape2) > > set.seed(2) > #make junk data > dat <- data.frame(user=1:5, + time=1:20, + url=paste0("https://domain.com/forum/thread?thread_id=",sample(5,20,t))) > head(dat) user time url 1 1 1 https://domain.com/forum/thread?thread_id=1 2 2 2 https://domain.com/forum/thread?thread_id=4 3 3 3 https://domain.com/forum/thread?thread_id=3 4 4 4 https://domain.com/forum/thread?thread_id=1 5 5 5 https://domain.com/forum/thread?thread_id=5 6 1 6 https://domain.com/forum/thread?thread_id=5 > #subet within time range > dat <- dat[dat$time >=1 & dat$time <= 20,] > > #make threadid variable > dat$threadid <- gsub("^.*thread_id=",'',dat$url) > > > #register parallel cores > registerdomc(4) > #count number of thread occurrences each user (in parallel) > dat.new <- ddply(dat,.(user,threadid),summarize,threadcount=length(threadid),.parallel=true) > #reshape data in format want > dat.new <- dcast(dat.new,user~threadid,value.var="threadcount",fill=0) > #add total views > dat.new$totalview <- rowsums(dat.new[,-1]) > dat.new user 1 2 3 4 5 totalview 1 1 1 0 1 0 2 4 2 2 1 1 0 1 1 4 3 3 0 1 1 1 1 4 4 4 2 0 2 0 0 4 5 5 1 0 2 0 1 4
Comments
Post a Comment