You have been provided with business.RData file from yelp_db. Open this in the R studio. (You
should see the data element imported in Global Environment Window.)
Address the following queries using R coding. Write all codes in an R script file and submit only
as an individual assignment.
Q1.
(a) View the dataset to know what is stored in the data element. What is the R code executed in
the command window?
(b) Which data structure is used to store the data? (Is it vector/matrix/list/data.frame?)
Find the names of the columns of your data?
(c) is_open column in the data element represents whether the business is open or not. Find out
how many businesses are open and how many are closed. (Use table() function to do that)
table(business$is_open)
(d) How many different attributes are there in attribute column ?names(business$attributes)
ncol(business$attributes)
(e) Find the average stars and average review count for all the businesses.
Q2. We want to subset the data for a specific state.
(a) Use which() function to find the indices in “business” data.frame where state is Arizona
(abbreviated as AZ). This will give you the row numbers that meet the criterion. Now use these row
numbers to subset the businesses in Arizona. Assign the new data.frame in business_AZ.
(b) Calculate the average review count and rating stars for businesses in AZ.
(c) Are the average rating count and average rating star in Arizona more than average rating
count and average rating stars calculated in the main data? What can you infer from this finding?
(d) Check the class of categories. Find the first category listed for the first business. How many
different categories are there in the Arizona business data?
(Hint: You can use unlist() and unique() functions for this task. Check in R help what these two
functions do.)
unique(business_AZ$categories)
#category_list<-business_AZ$categories
#category_list[1]
category<-unlist(business_AZ$categories)
category[1]
unique(category)
sum(!is.na(unique(category)))
(e) Repeat the same for the main business data. Does Arizona have all categories present in the
main business data? If not, what percentage of the categories is present in Arizona?
(f) How many businesses are operating 24 hours in Arizona?
(g) Save the review count for businesses in Arizona in a separate vector. Sort the vector and save
that as sorted review count (use sort() function for that).
Note that this sorted vector will not make sense if we combine with the Arizona business data (any
reason?). The solution is to sort in place.
(h) Find the top 10 businesses in Arizona based on only review count. Find top 10 businesses in
Arizona based on both review count and stars. Use order() function to do that. sortedbusinesstable = business_AZ[business_AZ$review_count,
decreasing = TRUE]
order(-business_AZ$review_count)
head(sort(business_AZ$review_count,decreasing=FALSE), n =
10)
head(order(-business_AZ$review_count), n = 10)
AZanswer<-data.frame(business_AZ[head(order(-business_AZ$review_count),
n = 10),])
AZanswer
head(order(-business_AZ$review_count,-business_AZ$stars), n
= 10)
AZanswer1<-data.frame(business_AZ[head(order(-business_AZ$review_count,-business_AZ$stars),
n =
table(business$is_open)
Comments
Post a Comment