People opnions might vary on the matter of them liking to travel around. Some like to travel at every oppurtunity, some like to travel occasionally, whereas others might not like to travel at all. Regardless, the one thing that they might all agree on is that travelling can get real expensive real fast. The cost of transportation, gas, food, tours, travel kits etc. However, one factor that is really influential is the amount of money you are spending on lodging. The amount you spend on lodging determines how much you can spend on every other thing like the food you can eat or explore, the places you can visit, the experience you can be part of and basically the memories you might make.
Airbnb offers excellent choices for short to even long term lodging options for travellers. It's cheaper than hotels and you have the added touch of interacting with a local to know about the place. However, as Airbnb has taken of so has the number of options available and this might really confuse the consumer.
Getting Started
1.1 Required Libraries and APK
1.2 Scrape the website
1.3 Tidy up the scrapped data
1.4 Map and Viz of Airbnb in US (Exploratory Analysis)
Story of New Jersey
2.1 In which city are bnbs located (Map and Viz)
2.2 Which cities in New Jersey have highest reviews
2.3 Desirable cities to Airbnb
2.4 Superhosts in US and New Jersey
2.5 Shows what kind of rooms are available in New Jersey
2.6 Price distribution of bnbs in NJ
Machine Learning
5.1 Multiple Linear Regression
5.2 P-value with statsmodels
5.3 Updated Model with Key Features
Conclusion
A more comprehensive list of the enviorment use is available at Requirments.txt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from mapboxgl.utils import *
from mapboxgl.viz import *
import pandasql
from IPython.display import IFrame
import os
import json
from sklearn import model_selection
We made a python scrapper (written below) to get the sample data from the Airbnb website. We then left it runing over night via a server that we hosted locally, gathering 33,000 data points or Airbnb listing from across the country on which we would later run the analysis.
Gathering data from a website like airbnb can be a little challenging since its restrictions on scrawlers running over the website. This could result in a lot of 429 errors and also 403 errors. Also figuring out how to get the data from the webiste from the webiste was tricky since we couldn't just target the html elements because then that would have invloved a lot of bots running automated html clicks and a scraper scraping certain parts. The work around for that was to tcapture the network and look for the file that was providing data to the airbnb website. After soring through all the files on the network we were able to find the one that was needed.
The next step was to search the website for all 50 sates in the US. The tricky part with airbnb is that it uses a special encoding for each of its states in the url and hence we had look for that specific id and generate that for every state so that we can tamper with the url based on our needs. After this was done we just had to make sure we were switching sessions to make it seem that we aren't on their website for too long.
Bingo! We were able to scrape 20 pages for each state.
The Airbnb data was scrapped from the following url : Airbnb.com
from bs4 import BeautifulSoup as bs
import requests
import urllib
import json
import time
import collections
import pprint
import csv
states = {
'Alabama': 'ChIJdf5LHzR_hogR6czIUzU0VV4',
'Alaska': 'ChIJG8CuwJzfAFQRNduKqSde27w',
'Arizona': 'ChIJaxhMy-sIK4cRcc3Bf7EnOUI',
'Arkansas': 'ChIJYSc_dD-e0ocR0NLf_z5pBaQ',
'California': 'ChIJPV4oX_65j4ARVW8IJ6IJUYs',
'Colorado': 'ChIJt1YYm3QUQIcR_6eQSTGDVMc',
'Connecticut': 'ChIJpVER8hFT5okR5XBhBVttmq4',
'Delaware': 'ChIJO9YMTXYFx4kReOgEjBItHZQ',
'Florida': 'ChIJvypWkWV2wYgR0E7HW9MTLvc',
'Georgia': 'ChIJV4FfHcU28YgR5xBP7BC8hGY',
'Idaho': 'ChIJ6Znkhaj_WFMRWIf3FQUwa9A',
'Illinois': 'ChIJGSZubzgtC4gRVlkRZFCCFX8',
'Indiana': 'ChIJHRv42bxQa4gRcuwyy84vEH4',
'Iowa': 'ChIJGWD48W9e7ocR2VnHV0pj78Y',
'Kansas': 'ChIJawF8cXEXo4cRXwk-S6m0wmg',
'Kentucky': 'ChIJyVMZi0xzQogR_N_MxU5vH3c',
'Louisiana': 'ChIJZYIRslSkIIYRA0flgTL3Vck',
'Maine': 'ChIJ1YpTHd4dsEwR0KggZ2_MedY',
'Maryland': 'ChIJ35Dx6etNtokRsfZVdmU3r_I',
'Massachusetts': 'ChIJ_b9z6W1l44kRHA2DVTbQxkU',
'Michigan': 'ChIJEQTKxz2qTE0Rs8liellI3Zc',
'Minnesota': 'ChIJmwt4YJpbWE0RD6L-EJvJogI',
'Mississippi': 'ChIJGdRK5OQyKIYR2qbc6X8XDWI',
'Missouri': 'ChIJfeMiSNXmwIcRcr1mBFnEW7U',
'Montana': 'ChIJ04p7LZwrQVMRGGwqz1jWcfU',
'Nebraska': 'ChIJ7fwMtciNk4cRxArzDwyQJ6E',
'Nevada': 'ChIJcbTe-KEKmYARs5X8qooDR88',
'New Hampshire': 'ChIJ66bAnUtEs0wR64CmJa8CyNc',
'New Jersey': 'ChIJn0AAnpX7wIkRjW0_-Ad70iw',
'New Mexico': 'ChIJqVKY50NQGIcRup41Yxpuv0Y',
'New York': 'ChIJqaUj8fBLzEwRZ5UY3sHGz90',
'North Carolina': 'ChIJgRo4_MQfVIgRGa4i6fUwP60',
'North Dakota': 'ChIJY-nYVxKD11IRyc9egzmahA0',
'Ohio': 'ChIJwY5NtXrpNogRFtmfnDlkzeU',
'Oklahoma': 'ChIJnU-ssRE5rIcRSOoKQDPPHF0',
'Oregon': 'ChIJVWqfm3xuk1QRdrgLettlTH0',
'Pennsylvania': 'ChIJieUyHiaALYgRPbQiUEchRsI',
'Rhode Island': 'ChIJD9cOYhQ15IkR5wbB57wYTh4',
'South Carolina': 'ChIJ49ExeWml-IgRnhcF9TKh_7k',
'South Dakota': 'ChIJpTjphS1DfYcRt6SGMSnW8Ac',
'Tennessee': 'ChIJA8-XniNLYYgRVpGBpcEgPgM',
'Texas': 'ChIJSTKCCzZwQIYRPN4IGI8c6xY',
'Utah': 'ChIJzfkTj8drTIcRP0bXbKVK370',
'Vermont': 'ChIJ_87aSGzctEwRtGtUNnSJTSY',
'Virginia': 'ChIJzbK8vXDWTIgRlaZGt0lBTsA',
'Washington': 'ChIJ-bDD5__lhVQRuvNfbGh4QpQ'
}
data_dict = []
for state in states:
id = states[state]
state = state.replace(" ", "%20")
print(state)
url = "https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&auto_ib=false&client_session_id=7a1719fb-5452-4be6-8040-45e57eddd9c8¤cy=USD¤t_tab_id=home_tab&experiences_per_grid=20&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&hide_dates_and_guests_filters=false&is_guided_search=true&is_new_cards_experiment=true&is_standard_search=true&items_per_grid=18&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&locale=en&metadata_only=false&query={}%2C%20United%20States&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&satori_version=1.1.14&screen_height=969&screen_size=medium&screen_width=1114&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true&timezone_offset=-360&version=1.6.5".format(state)
for i in range(0,20):
if i == 0:
response = requests.get(url)
print(response)
page = urllib.request.urlopen(url)
print(page)
soup = bs(page, "html.parser")
print(soup)
output = "0{}page.json".format(state)
with open(output, 'wb') as f:
f.write(str(soup).encode())
else:
response = requests.get("https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&auto_ib=false&client_session_id=7a1719fb-5452-4be6-8040-45e57eddd9c8¤cy=USD¤t_tab_id=home_tab&experiences_per_grid=20&federated_search_session_id=0360991d-1087-4e46-a106-d5090e86351d&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&hide_dates_and_guests_filters=false&is_guided_search=true&is_new_cards_experiment=true&is_standard_search=true&items_offset={}&items_per_grid=18&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&last_search_session_id=b58af588-7a00-48a0-b9ff-10d3c4a47a52&locale=en&metadata_only=false&place_id={}&query={}%2C%20United%20States&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=X74Q583S&satori_version=1.1.14&screen_height=969&screen_size=medium&screen_width=1114&search_type=pagination§ion_offset=4&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true&timezone_offset=-360&version=1.6.5".format(i*18,id,state))
print(response)
page = urllib.request.urlopen("https://www.airbnb.com/api/v2/explore_tabs?_format=for_explore_search_web&auto_ib=false&client_session_id=7a1719fb-5452-4be6-8040-45e57eddd9c8¤cy=USD¤t_tab_id=home_tab&experiences_per_grid=20&federated_search_session_id=0360991d-1087-4e46-a106-d5090e86351d&fetch_filters=true&guidebooks_per_grid=20&has_zero_guest_treatment=true&hide_dates_and_guests_filters=false&is_guided_search=true&is_new_cards_experiment=true&is_standard_search=true&items_offset={}&items_per_grid=18&key=d306zoyjsyarp7ifhu67rjxn52tv0t20&last_search_session_id=b58af588-7a00-48a0-b9ff-10d3c4a47a52&locale=en&metadata_only=false&place_id={}&query={}%2C%20United%20States&query_understanding_enabled=true&refinement_paths%5B%5D=%2Fhomes&s_tag=X74Q583S&satori_version=1.1.14&screen_height=969&screen_size=medium&screen_width=1114&search_type=pagination§ion_offset=4&selected_tab_id=home_tab&show_groupings=true&supports_for_you_v3=true&timezone_offset=-360&version=1.6.5".format(i*18,id,state))
print(page)
soup = bs(page, "html.parser")
print(soup)
output = "{}{}page.json".format(i,state)
with open(output, 'wb') as f:
f.write(str(soup).encode())
for i in range(0,20):
print(i)
with open('{}{}page.json'.format(i,state), 'r', encoding="utf8") as file:
data = json.load(file)
print(data)
if i == 0:
print("yo")
homes = data.get('explore_tabs')[0].get('sections')[0].get('listings')
else:
homes = data.get('explore_tabs')[0].get('sections')[0].get('listings')
print("fo")
for home in homes:
obj = {
"state": "{}".format(state),
"room_id": "{}".format(str(home.get('listing').get('id'))),
"name": "{}".format(str(home.get('listing').get('name'))),
"city": "{}".format(home.get('listing').get('city')),
"person_cap": "{}".format(home.get('listing').get('person_capacity')),
"bedrooms": "{}".format(home.get('listing').get('beds')),
"bathrooms": "{}".format(home.get('listing').get('bathrooms')),
"amenities": "{}".format(home.get('listing').get('preview_amenities')),
"reviews": "{}".format(home.get('listing').get('reviews_count')),
"prop_type": "{}".format(home.get('listing').get('room_and_property_type')),
"guests": "{}".format(home.get('listing').get('guest_label')),
"star": "{}".format(home.get('listing').get('star_rating')),
"avg_rating": "{}".format(home.get('listing').get('avg_rating')),
"min_nights": "{}".format(home.get('listing').get('min_nights')),
"max_nights": "{}".format(home.get('listing').get('max_nights')),
"price": "{}".format(home.get('pricing_quote').get('rate').get('amount')),
"lat": "{}".format(home.get('listing').get('lat')),
"long": "{}".format(home.get('listing').get('lng')),
"price_factor": "{}".format(home.get('pricing_quote').get('weekly_price_factor')),
"super_host": "{}".format(home.get('listing').get('is_superhost'))
}
data_dict.append(obj)
f = open("sample.csv", "w", encoding='utf-8')
writer = csv.DictWriter(
f, fieldnames=["state","room_id", "name", "city", "person_cap", "bedrooms", "bathrooms",
"amenities", "reviews", "prop_type", "guests", "star", "avg_rating", "min_nights",
"max_nights", "price", "lat", 'long', 'price_factor','super_host'])
writer.writeheader()
writer.writerows(data_dict)
After scrapping the only thing left was to make some adjustments to the type of series in the dataframe so we could later use them in visualizations and mapping. We also made a dictionary mapping each state to a data frame of the Airbnb that were present in it
data = pd.read_csv("sample.csv")
data = data.drop('amenities',axis = 1)
data['avg_rating'] = pd.to_numeric(data['avg_rating'], errors='coerce')
states = data['state'].unique().tolist()
gb = data.groupby('state')
lst = [gb.get_group(x) for x in gb.groups]
dictionary = dict(zip(states, lst))
data.head()
We start out by see how the states in the US stack up in their avg_ratings. So we decided to create a Chloropleth map to see how the states were doing in terms of the avg_ratings. We found that almost all the sates had a rating over 4.4 which was very impressive.
To create the map we had to obtain a geojson file, which we had to create on our own so we rand the script below to make the json file and finally host the geojson file on our own website to be able to accessed by Jupyter.
query = "SELECT avg(avg_rating) as rating, state FROM data GROUP BY state"
sub = pandasql.sqldf(query,globals())
l = sub.values.tolist()
# sta = {}
# for val in l:
# sta[val[1].replace("%20", " ")] = val[0]
# with open('https://320final.github.io/states.geojson') as f:
# dta = json.load(f)
# for feature in dta['features']:
# if feature['properties']['name'] in sta:
# feature['properties']['rating'] = round(sta[feature['properties']['name']],1)
# else:
# feature['properties']['rating'] = 4.4
# with open('states.geojson', 'w') as outfile:
# json.dump(dta, outfile)
token = ('pk.eyJ1IjoiamFwbmVldCIsImEiOiJjazQzM25majcwM3diM21uMXk4bnVieWl6In0.Nw00DmDETPkUH7dtCSzC9Q')
viz = ChoroplethViz('https://320final.github.io/states.geojson',
access_token='pk.eyJ1IjoiamFwbmVldCIsImEiOiJjazQzM25majcwM3diM21uMXk4bnVieWl6In0.Nw00DmDETPkUH7dtCSzC9Q',
color_property='rating',
color_stops=create_color_stops([3.4,4.6,4.7,4.8,4.9], colors=['#f5fcff', '#dbf3fa', '#b7e9f7', '#92dff3', '#7AD7f0']),
# color_function_type='interpolate',
opacity=0.8,
style = 'mapbox://styles/mapbox/light-v9?optimize=true',
center=(-96, 37.8),
zoom=3,
label_color = '#ececec',
below_layer='waterway-label')
viz.show()
First we wanted to see how the distribution of airbnb's in each state based on the city. So simply put we wanted to see how amny airbnb's existed in a particular city all throughout the US.
query = "SELECT city, count(city) as count, lat, long FROM data Group By city"
sub = pandasql.sqldf(query,globals())
display(sub)
l = sub.values.tolist()
df_split = np.array_split(sub, 5)
for i in range(0,5):
points = df_to_geojson(df_split[i], properties=['city', 'count'], lat = 'lat', lon = 'long', precision = 3)
# points2 = df_to_geojson(sub2, properties=['city', 'count'], lat = 'lat', lon = 'long', precision = 3)
#Create a heatmap
color_stops = create_color_stops([1,10,50,100], colors=['#f5fcff', '#dbf3fa', '#b7e9f7', '#92dff3'])
#Create a heatmap
viz = ClusteredCircleViz(points,
access_token='pk.eyJ1IjoiamFwbmVldCIsImEiOiJjazQzM25majcwM3diM21uMXk4bnVieWl6In0.Nw00DmDETPkUH7dtCSzC9Q',
color_stops=color_stops,
style = 'mapbox://styles/mapbox/light-v9?optimize=true',
radius_stops=[[1,5], [10, 10], [50, 15], [100, 20]],
radius_default=2,
cluster_maxzoom=10,
cluster_radius=30,
label_size=8,
opacity=0.9,
center=(-95, 40),
zoom=3)
viz.show()
Next we wanted to see how the prices of airbnbs were distributed across the United States. This would help us get an idea which states are the most expensive to bnb in as comapred to some others.
costs = []
for i in lst:
mean = i['price'].mean()
costs.append(mean)
len(costs)
len(states)
plt.figure(figsize = (30, 10))
ax = sns.barplot(x = states, y = costs)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize=20)
plt.tight_layout()
plt.show()
token = 'pk.eyJ1IjoiamFwbmVldCIsImEiOiJjazQzM25majcwM3diM21uMXk4bnVieWl6In0.Nw00DmDETPkUH7dtCSzC9Q'
query = "SELECT city, price, lat,long FROM data Group By city Order By price"
sub = pandasql.sqldf(query,globals())
display(sub)
df_split = np.array_split(sub, 5)
for i in range(0,5):
# points = df_to_geojson(df_split[i], properties=['city', 'count'], lat = 'lat', lon = 'long', precision = 3)
points = df_to_geojson(df_split[i],
properties=['city', 'price'],
lat='lat', lon='long', precision=3)
# Generate data breaks and color stops from colorBrewer
color_breaks = [10,50,100,200,300,500,700,1000,1500]
color_stops = create_color_stops(color_breaks, colors='YlGnBu')
# Create the viz from the dataframe
viz = CircleViz(points,
access_token=token,
height='600px',
color_property='price',
color_stops=color_stops,
radius = 5,
center=(-95, 40),
zoom=3,
below_layer='waterway-label')
viz.show()
One of the most importnat aspects of choosing an airbnb is the number of reviews. Since we couldn't get the actual reviews from the website we had the count for each city. So we plot a heatmap to see where the reviews were the highest and where the least. We found out that Bryson city in North Carolina had the least number of reviews where as satellite beach in florida had the highest number of reviews. Which kind of makes sense since florida is one of the most visited destinations in the US.
query = "SELECT city, reviews, lat, long FROM data Group By city"
sub = pandasql.sqldf(query,globals())
display(sub)
# Generate data breaks and color stops from colorBrewer
color_breaks = [0,10,50,100,150,200,250,300,350]
heatmap_color_stops = create_color_stops([0.01,0.25,0.5,0.75,1], colors='RdPu')
heatmap_radius_stops = [[0,1], [15, 40]] #increase radius with zoom
color_stops = create_color_stops(color_breaks, colors='Spectral')
heatmap_weight_stops = create_weight_stops(color_breaks)
df_split = np.array_split(sub, 3)
for i in range(0,3):
points = df_to_geojson(df_split[i],
properties=['city', 'reviews'],
lat='lat', lon='long', precision=3)
viz = HeatmapViz(points,
access_token='pk.eyJ1IjoiamFwbmVldCIsImEiOiJjazQzM25majcwM3diM21uMXk4bnVieWl6In0.Nw00DmDETPkUH7dtCSzC9Q',
weight_property='reviews',
style = 'mapbox://styles/mapbox/light-v9?optimize=true',
weight_stops=heatmap_weight_stops,
color_stops=heatmap_color_stops,
radius_stops=heatmap_radius_stops,
opacity=0.9,
center=(-95, 40),
zoom=3,
below_layer='waterway-label')
viz.show()
Next we deep dive in one state to do some analysis on certain other factors regarding a travellers Airbnb stay.
We first start out by plotting how the airbnbs are scattered across the state. This gives us the perspective which cities are more commonly to get travellers as compared to others.
query = "SELECT city, count(city) as count, lat, long FROM data WHERE state = 'New%20Jersey' Group By city Order By count"
sub = pandasql.sqldf(query,globals())
display(sub)
points = df_to_geojson(sub,
properties=['city', 'count'],
lat='lat', lon='long', precision=3)
# Generate data breaks and color stops from colorBrewer
color_breaks = [1,2,3,4,5,10,30,40,45]
color_stops = create_color_stops(color_breaks, colors='YlGnBu')
# Create the viz from the dataframe
viz = CircleViz(points,
access_token=token,
height='600px',
color_property='count',
color_stops=color_stops,
label_property='reviews',
stroke_color='black',
radius = 5,
center=(-75, 40),
zoom=7,
below_layer='waterway-label')
viz.show()
Once we had the cities with airbnbs it was time to see which of those airbnbs had the highest rating so that we could suggest the perfect place to stay in NJ
query = "SELECT city, reviews, count(city) as count, lat, long FROM data WHERE state = 'New%20Jersey' Group By city"
sub = pandasql.sqldf(query,globals())
display(sub)
plt.figure(figsize = (40, 10))
ax = sns.barplot(x = sub['city'], y = sub['reviews'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize=20)
plt.tight_layout()
plt.show()
Even though New Brunswick has just 5 airbnb's it has the highest number of reviews. This makes New Brunswick a pretty loveable place to stay in. Comparing this to Jersey City which has 56 reported BNB's but only around 300 reviews which is much less when comapred to New Brunswick.
Depicts which city have a better ratio of avg_rating and number of bnbs in that city.
Here we expand more on the above stated results. We calculate the ratio of avg_rating and number of bnbs in the city to see where a traveller would love to stay. We thought anything below a 4 would be unstatisfactory.
query = "SELECT city, count(city) as count, avg(avg_rating) as rating ,lat, long, (avg_rating/count(city)) as ratio FROM data WHERE state = 'New%20Jersey' Group By city Order By ratio"
sub = pandasql.sqldf(query,globals())
display(sub)
plt.figure(figsize = (40, 10))
ax = sns.barplot(x = sub['city'], y = sub['ratio'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize=20)
plt.tight_layout()
plt.show()
Cities like cherry hill racked up the number one spot. On researching more on our results we got to know that cherry hill infact has the highest crime rate in NJ. Well you would want to stear clear of those regions then!!
query = "SELECT city, count(city) as count, avg(avg_rating) as rating ,lat, long, (avg_rating/count(city)) as ratio FROM data WHERE state = 'New%20Jersey' Group By city Order By ratio"
sub = pandasql.sqldf(query,globals())
query = "SELECT city, ratio, lat, long FROM sub where ratio < 4"
sub = pandasql.sqldf(query,globals())
display(sub)
points = df_to_geojson(sub,
properties=['city', 'ratio'],
lat='lat', lon='long', precision=3)
# Generate data breaks and color stops from colorBrewer
color_breaks = [0,0.5,1,1.5,2,2.5]
color_stops = create_color_stops(color_breaks, colors='YlGnBu')
# Create the viz from the dataframe
viz = CircleViz(points,
access_token=token,
height='600px',
color_property='ratio',
color_stops=color_stops,
label_property='reviews',
radius = 5,
center=(-75, 41),
zoom=7,
below_layer='waterway-label')
viz.show()
Superhosts are hosts with very high ratings and treat their customers the best and have been regarded as the best host by Airbnb. This factor is very important when choosing an Airbnb, because you dont want to get stuck with one those snobby and rude hosts!!
query = "SELECT count(CASE WHEN super_host THEN 1 END) as host_true, count(CASE WHEN not super_host THEN 1 END) as host_false FROM data"
sub = pandasql.sqldf(query,globals())
l = sub.values.tolist()
display(sub)
plt.bar('host_true',sub['host_true'])
plt.bar('host_false', sub['host_false'])
plt.show()
Here we see that based on our sample about 40% of the hosts are superhosts.
query = "SELECT city, count(CASE WHEN super_host THEN 1 END) as host_true, count(CASE WHEN not super_host THEN 1 END) as host_false, lat, long FROM data WHERE state = 'New%20Jersey' Group By city Order By host_true"
sub = pandasql.sqldf(query,globals())
l = sub.values.tolist()
display(sub)
plt.bar('host_true',sub['host_true'])
plt.bar('host_false', sub['host_false'])
plt.show()
points = df_to_geojson(sub,
properties=['city', 'host_true'],
lat='lat', lon='long', precision=3)
# Generate data breaks and color stops from colorBrewer
color_breaks = [0,5,10,15,20]
color_stops = create_color_stops(color_breaks, colors='YlGnBu')
# Create the viz from the dataframe
viz = CircleViz(points,
access_token=token,
height='600px',
color_property='host_true',
color_stops=color_stops,
label_property='reviews',
radius = 5,
center=(-75, 41),
zoom=7,
below_layer='waterway-label')
viz.show()
The above result was suprising as we see a huge decline in the number of superhosts when we compare it to the result we got for the US. This means most of the hosts in NJ are unverified. Again we see that New Brunswick had all 5 hosts as superhosts which again makes sense based on our analysis of revoew in the previous cells.
We wanted to how cities in NJ stack up when it comes to max occupancy, because sometimes you are travelling with your family and you need a bigger place to stay! And if thats the case then East Brunswick is your pick.
query = "SELECT city, count(person_cap) as count, avg(avg_rating) as rating ,lat, long, person_cap FROM data WHERE state = 'New%20Jersey' Group By city"
sub = pandasql.sqldf(query,globals())
display(sub)
plt.figure(figsize = (40, 10))
ax = sns.barplot(x = sub['city'], y = sub['person_cap'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize=20)
plt.tight_layout()
plt.show()
query = "SELECT count(prop_type) as count, lat,long, prop_type FROM data WHERE state = 'New%20Jersey' Group By prop_type"
sub = pandasql.sqldf(query,globals())
display(sub)
plt.figure(figsize = (40, 50))
ax = sns.barplot(y = sub['prop_type'], x = sub['count'])
ax.set_yticklabels(ax.get_yticklabels(), rotation=40, ha="right",fontsize=30)
plt.tight_layout()
plt.show()
Finally the important part, how is price distributed in New Jersey.
query = "SELECT city, price, lat,long FROM data WHERE state = 'New%20Jersey' Group By city Order By price"
sub = pandasql.sqldf(query,globals())
display(sub)
points = df_to_geojson(sub,
properties=['city', 'price'],
lat='lat', lon='long', precision=3)
# Generate data breaks and color stops from colorBrewer
color_breaks = [10,50,100,200,300,400,500,600,800]
color_stops = create_color_stops(color_breaks, colors='YlGnBu')
# Create the viz from the dataframe
viz = CircleViz(points,
access_token=token,
height='600px',
color_property='price',
color_stops=color_stops,
radius = 5,
center=(-75, 40.2),
zoom=7,
below_layer='waterway-label')
viz.show()
From above we infer that areas around Atlantic city are really expensive, which makes sense because of the night life and all the casinos.
We continue analyzing the state of New Jersey. We first convert the series type of series like price, bedrooms etc. from object to float so that we can use it with sklearn and Linear Regression. We then also check if there are any nan or infinite numbers in the jersey dataframe. Seeing that there are some, we just drop the columns with nan as there are only 3 of them and we are still left with a sample size of 357.
import warnings
warnings.filterwarnings('ignore')
jersey = dictionary['New%20Jersey']
jersey['price'] = jersey['price'].astype(np.int64)
jersey['bedrooms'] = jersey['bedrooms'].astype(np.float64)
jersey['reviews'] = jersey['reviews'].astype(np.float64)
jersey['bathrooms'] = jersey['bathrooms'].astype(np.float64)
jersey['person_cap'] = jersey['person_cap'].astype(np.float64)
# print(jersey.dtypes)
print(np.any(np.isnan(jersey['price']))) # Should be False
print(np.any(np.isfinite(jersey['price']))) # Should be True
print(np.any(np.isnan(jersey['avg_rating'])))
print(np.any(np.isfinite(jersey['avg_rating'])))
print(len(jersey))
jersey.dropna(how='any', inplace=True)
print(len(jersey))
jersey.reset_index(drop=True)
For the purpose of the analysis we are only considering numerical features as we aim to see what aspect outside of the location of Airbnb influences it's cost
The features (independent variables) we consider are the following
The target (dependent variable) we wish to conclude from the the rest the price .
features = jersey.loc[:, ['person_cap','bedrooms','bathrooms','reviews','avg_rating','min_nights','max_nights']]
target = jersey['price']
Define X and y for use in scikit-learn's LinearRegression() function. Fit the model.
Access this awesome tutorial by Toward's Data Science for more information on multiple linear regression.
X = features
y = target
lm = LinearRegression()
model = lm.fit(X,y)
lm.score(X,y)
We get a score of 0.5956 or approximately 60%, meaning that we can explain about 60% or a little more that half the variance in our model.
sk_coeffs = lm.coef_.tolist()
for attr, coef in zip(features, sk_coeffs):
print("Attribute: {}, Coefficient: {}".format(attr,coef))
From the above we see that the most influencing attributes appear to be avg. rating, number of bathrooms available followed by person_cap or the number of guests that are allowed. Suprisingly, the duration of stay (min and max nights) and reviews appears to have a small or even negative coefficient values.
To further see, which attributes have a meanigful impact we need to observe their p-values
We use the features and target previously created in the statsmodel api. Folowing the addition of constant we use the method of Ordinary Least Squares (OLS) to estimate our parameters and how do they hold up.
Our Null Hypothesis is that none of the attributes have a noticable impact on the the price of airbnb
Access this wikipedia article to know more about the methodolgy used in this statsmodel
sm_y = y
sm_X = X
sm_X = sm.add_constant(X)
OLS_model = sm.OLS(y,X).fit()
OLS_model.summary()
According to the coefficients ('coef') column, it is clear that the number of bathrooms has by far the most impact on the price of the Airbnb. Other attributes with significant impact on price are number of guests allowed (person_cap), number of bedrooms, minimum duration of stay and the ones with least impact are maximum duration of stay, reviews and average rating. While most of this is the same as the outpout from the Linear Regression model, we observe that 2 attributes average rating and minimum nights seems to have changed
The p-values above ('P>|t|') we can see that at the significance level of alpha = 0.05, each attribute has a higher p-value than 0.05 except the number of bathrooms and number of reviews.
Hence, we can conclude that the attributes affecting the price mainly are the number of revies and number of bathrooms.
Another thing to note is our R-Squared value, of 0.809 which is higher value than the one obtained from our Linear Regression model.
We once again make a new Linear Regression model with just Number of Bathrooms and average rating.
features = jersey.loc[:, ['bathrooms','avg_rating']]
target = jersey['price']
X = features
y = target
lm = LinearRegression()
model = lm.fit(X,y)
lm.score(X,y)
Since, we get an extremely similar score, we can safely conclude that the 2 factors influencing the price of Airbnb, outside of it's location are the number of bathrooms and the average rating of the Airbnb. One of which is defiently a suprise
Since booking a lodging is a very big hassle we give an overview of some factors that one might consider while booking an Airbnb in the place they want to travel. Airbnb has over the recent years become a popular alternative to hotels or motels for tourists. We focused on features such as prices, super hosts, reviews and the area in which Airbnb is located. From the following developments, we concluded that a lot goes into thinking where to live but if we focus on these main criteria finding a good lodging won't be a hassle. For example, avoiding areas with not a very good review/count ratio and also try to find a super host. Prices can vary according to one's choice of location but if you want a cheap, cozy place avoid the downtown and rather choose a semi-suburban area like New Brunswick in New Jersey.
However, there are some limitations to our analysis of Airbnb data. Mainly, that we were only able to analyze it for the current time and the price might vary from season to season as it is a common occurrence in the hotel industry. Additionally, we only use Numerical data for our machine learning analysis and using String data like City area would have improved our results
Nonetheless, be referencing this analysis traveler and consumers might be better able to make use of Airbnb and make their stay more enjoyable and pocket-friendly.