Hipotezės:
Analizuojami miestai: Amsterdamas, Atėnai, Barselona, Berlynas, Budapeštas, Lisabona, Londonas, Paryžius, Roma, Viena
Turimi duomenų rinkiniai kiekvienam miestui savaitgaliais ir darbo dienomis:
realSum - the total price of the Airbnb listing;
room_type - private/shared/entire home/apt;
room_shared - whether the room is shared or not;
room_private - whether the room is private or not;
person_capacity - the maximum number of people that can stay in the room;
host_is_superhost - boolean value indicating if host is a superhost or not;
multi - indicator whether listing is for multiple rooms or not;
biz - indicator whether listing is for business purposes or not;
cleanliness_rating - the cleanliness rating of the listing;
guest_satisfaction_overall - overall rating from guests camparing all listings offered by host;
bedrooms - the number of bedrooms in the listing;
dist - distance from city center;
metro_dist - the distance from the nearest metro station;
lng & lat - coordinates for location identification.
import os
import glob
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
from scipy import stats
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
Šiame etape įsikeliami parsisiųsti Airbnb duomenų rinkiniai kiekvienam miestui (10 csv failų su darbo dienų duomenimis ir 10 csv failų su savaitgalių duomenimis), atskiri failai naudojantis ciklu "for" sujungiami į vieną, ciklo metu sukuriami nauji stulpeliai "City" ir "Weekday" iš failų pavadinimų, panaikinami nereikalingi stulpeliai, pakeičiami duomenų tipai, sukuriamas naujas stulpelis "price_per_person".
# Pakeičiama direktorija. Nuvedama į duomenų saugojimo vietą
os.chdir("C:\\Users\\egecaite.BAIPGROUP\\Desktop\\DataEra\\Baigiamasis\\Data")
# Nurodomas failų plėtinys
extension = 'csv'
# Gaunamas visų failų pavadinimų su nurodytu plėtiniu sąrašas
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# Sukuriamas tuščias sąrašas DataFrames saugojimui
dfs = []
# Sukuriams for ciklas, kuris pereidamas per kiekvieną CSV failą jį perskaito į DataFrame ir sudeda visus
# į sukurtą tuščią sąrašą dfs
for file in all_filenames:
# Perskaitomas CSV failas į DataFrame
df = pd.read_csv(file)
# "City" ir "Weekday" paimama iš CSV failo pavadinimo
city = os.path.basename(file).split('_')[0]
weekday = os.path.basename(file).split('_')[1].split('.')[0]
# Pridedami nauji stulpeliai "City" ir "Weekday"
df['City'] = city
df['Weekday'] = weekday
# DataFrame įdedams su Append į dfs sąrašą
dfs.append(df)
# Sukuriama naujas DataFrame, kuris talpina visus DataFrames dfs sąraše
airbnb = pd.concat(dfs, ignore_index=True)
# Išmetami nereikalingi stulpeliai su drop
airbnb = airbnb.drop(columns=['attr_index','attr_index_norm','rest_index','rest_index_norm'])
airbnb.drop('Unnamed: 0', axis=1, inplace=True)
# Miestų pavadinimai perrašomi iš didžiosios raidės
def capitalize_column(df, column):
df[column] = df[column].str.title()
return df
airbnb = capitalize_column(airbnb, 'City')
# Pakeičiami duomenų tipai'multi' ir 'biz' į boolean
airbnb['multi'] = airbnb['multi'].astype(bool)
airbnb['biz'] = airbnb['biz'].astype(bool)
# Pakeičiami stulpelių pavadinimai į suprantamesnius
airbnb.rename(columns={'realSum':'price', 'dist' : 'citycentre_dist', 'multi' : 'multiple_room',
'biz' : 'business_room', 'City' : 'city', 'Weekday' : 'weekday'}, inplace=True)
# Sukuriamas naujas stulpelis 'price_per_person'
airbnb['price_per_person'] = airbnb['price'] / airbnb['person_capacity']
# Išsaugomas sutvarkytas duomenų rinkinys
#airbnb.to_csv('airbnb_full.csv', index=False, encoding='utf-8-sig')
airbnb.head(5)
price | room_type | room_shared | room_private | person_capacity | host_is_superhost | multiple_room | business_room | cleanliness_rating | guest_satisfaction_overall | bedrooms | citycentre_dist | metro_dist | lng | lat | city | weekday | price_per_person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 194.033698 | Private room | False | True | 2.0 | False | True | False | 10.0 | 93.0 | 1 | 5.022964 | 2.539380 | 4.90569 | 52.41772 | Amsterdam | weekdays | 97.016849 |
1 | 344.245776 | Private room | False | True | 4.0 | False | False | False | 8.0 | 85.0 | 1 | 0.488389 | 0.239404 | 4.90005 | 52.37432 | Amsterdam | weekdays | 86.061444 |
2 | 264.101422 | Private room | False | True | 2.0 | False | False | True | 9.0 | 87.0 | 1 | 5.748312 | 3.651621 | 4.97512 | 52.36103 | Amsterdam | weekdays | 132.050711 |
3 | 433.529398 | Private room | False | True | 4.0 | False | False | True | 9.0 | 90.0 | 2 | 0.384862 | 0.439876 | 4.89417 | 52.37663 | Amsterdam | weekdays | 108.382349 |
4 | 485.552926 | Private room | False | True | 2.0 | True | False | False | 10.0 | 98.0 | 1 | 0.544738 | 0.318693 | 4.90051 | 52.37508 | Amsterdam | weekdays | 242.776463 |
# Eilučių ir stulpelių kiekiai
airbnb.shape
(51707, 18)
# Duomenų užpildymas ir tipai stulpeliuose
airbnb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51707 entries, 0 to 51706 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 51707 non-null float64 1 room_type 51707 non-null object 2 room_shared 51707 non-null bool 3 room_private 51707 non-null bool 4 person_capacity 51707 non-null float64 5 host_is_superhost 51707 non-null bool 6 multiple_room 51707 non-null bool 7 business_room 51707 non-null bool 8 cleanliness_rating 51707 non-null float64 9 guest_satisfaction_overall 51707 non-null float64 10 bedrooms 51707 non-null int64 11 citycentre_dist 51707 non-null float64 12 metro_dist 51707 non-null float64 13 lng 51707 non-null float64 14 lat 51707 non-null float64 15 city 51707 non-null object 16 weekday 51707 non-null object 17 price_per_person 51707 non-null float64 dtypes: bool(5), float64(9), int64(1), object(3) memory usage: 5.4+ MB
# Unikalūs miestai
airbnb['city'].unique()
array(['Amsterdam', 'Athens', 'Barcelona', 'Berlin', 'Budapest', 'Lisbon', 'London', 'Paris', 'Rome', 'Vienna'], dtype=object)
# Unikalūs kambarių tipai
airbnb['room_type'].unique()
array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)
# Skaitinių stulpelių greita statistinė analizė
airbnb.describe()
price | person_capacity | cleanliness_rating | guest_satisfaction_overall | bedrooms | citycentre_dist | metro_dist | lng | lat | price_per_person | |
---|---|---|---|---|---|---|---|---|---|---|
count | 51707.000000 | 51707.000000 | 51707.000000 | 51707.000000 | 51707.00000 | 51707.000000 | 51707.000000 | 51707.000000 | 51707.000000 | 51707.000000 |
mean | 279.879591 | 3.161661 | 9.390624 | 92.628232 | 1.15876 | 3.191285 | 0.681540 | 7.426068 | 45.671128 | 95.038708 |
std | 327.948386 | 1.298545 | 0.954868 | 8.945531 | 0.62741 | 2.393803 | 0.858023 | 9.799725 | 5.249263 | 121.129949 |
min | 34.779339 | 2.000000 | 2.000000 | 20.000000 | 0.00000 | 0.015045 | 0.002301 | -9.226340 | 37.953000 | 8.851498 |
25% | 148.752174 | 2.000000 | 9.000000 | 90.000000 | 1.00000 | 1.453142 | 0.248480 | -0.072500 | 41.399510 | 51.672921 |
50% | 211.343089 | 3.000000 | 10.000000 | 95.000000 | 1.00000 | 2.613538 | 0.413269 | 4.873000 | 47.506690 | 75.290339 |
75% | 319.694287 | 4.000000 | 10.000000 | 99.000000 | 1.00000 | 4.263077 | 0.737840 | 13.518825 | 51.471885 | 114.622850 |
max | 18545.450285 | 6.000000 | 10.000000 | 100.000000 | 10.00000 | 25.284557 | 14.273577 | 23.786020 | 52.641410 | 9272.725142 |
# Tekstinių stulpelių analizė
airbnb.describe(include='object')
room_type | city | weekday | |
---|---|---|---|
count | 51707 | 51707 | 51707 |
unique | 3 | 10 | 2 |
top | Entire home/apt | London | weekends |
freq | 32648 | 9993 | 26207 |
# Airbnb pasiūlymų kiekiai skirtinguose miestuose
airbnb.groupby('city')['price'].count().sort_values(ascending=False)
city London 9993 Rome 9027 Paris 6688 Lisbon 5763 Athens 5280 Budapest 4022 Vienna 3537 Barcelona 2833 Berlin 2484 Amsterdam 2080 Name: price, dtype: int64
# Airbnb pasiūlymų kiekiai skirtinguose miestuose savaitės dienomis ir savaitgaliais
airbnb.groupby(['city', 'weekday']).agg({'room_type' : 'count'}).reset_index().pivot(
'city', 'weekday', 'room_type').sort_values(by= ['weekdays', 'weekends'], ascending=[False, False]).rename(
columns={'weekdays' : 'quantity_in_weekdays', 'weekends' : 'quantity_in_weekends'})
weekday | quantity_in_weekdays | quantity_in_weekends |
---|---|---|
city | ||
London | 4614 | 5379 |
Rome | 4492 | 4535 |
Paris | 3130 | 3558 |
Lisbon | 2857 | 2906 |
Athens | 2653 | 2627 |
Budapest | 2074 | 1948 |
Vienna | 1738 | 1799 |
Barcelona | 1555 | 1278 |
Berlin | 1284 | 1200 |
Amsterdam | 1103 | 977 |
airbnb.sort_values(by='price', ascending = False)
price | room_type | room_shared | room_private | person_capacity | host_is_superhost | multiple_room | business_room | cleanliness_rating | guest_satisfaction_overall | bedrooms | citycentre_dist | metro_dist | lng | lat | city | weekday | price_per_person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3590 | 18545.450285 | Entire home/apt | False | False | 2.0 | True | False | True | 10.0 | 100.0 | 1 | 1.196536 | 0.381128 | 23.73200 | 37.98600 | Athens | weekdays | 9272.725142 |
34803 | 16445.614689 | Entire home/apt | False | False | 2.0 | False | False | False | 9.0 | 100.0 | 1 | 4.602378 | 0.118665 | 2.29772 | 48.83669 | Paris | weekdays | 8222.807345 |
24348 | 15499.894165 | Entire home/apt | False | False | 3.0 | True | False | True | 10.0 | 95.0 | 3 | 0.269101 | 0.227193 | -0.13038 | 51.50995 | London | weekdays | 5166.631388 |
48380 | 13664.305916 | Private room | False | True | 2.0 | False | False | False | 9.0 | 87.0 | 1 | 2.239501 | 0.414395 | 16.34356 | 48.20751 | Vienna | weekdays | 6832.152958 |
50787 | 13656.358834 | Private room | False | True | 2.0 | False | False | False | 9.0 | 87.0 | 1 | 2.239486 | 0.414409 | 16.34356 | 48.20751 | Vienna | weekends | 6828.179417 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5316 | 42.884259 | Private room | False | True | 3.0 | False | False | True | 10.0 | 90.0 | 1 | 0.953017 | 0.370606 | 23.72700 | 37.98100 | Athens | weekends | 14.294753 |
15917 | 40.184236 | Private room | False | True | 3.0 | False | True | False | 9.0 | 90.0 | 3 | 8.014306 | 4.193543 | 19.13895 | 47.54219 | Budapest | weekends | 13.394745 |
13954 | 39.009259 | Private room | False | True | 3.0 | False | True | False | 9.0 | 90.0 | 3 | 8.014301 | 4.193548 | 19.13895 | 47.54219 | Budapest | weekdays | 13.003086 |
13884 | 37.129295 | Entire home/apt | False | False | 2.0 | False | False | False | 10.0 | 93.0 | 2 | 4.644683 | 0.557410 | 19.11517 | 47.50491 | Budapest | weekdays | 18.564647 |
15563 | 34.779339 | Private room | False | True | 2.0 | False | True | False | 10.0 | 97.0 | 1 | 9.986018 | 7.847268 | 18.96347 | 47.56406 | Budapest | weekends | 17.389670 |
51707 rows × 18 columns
# Brangiausi Airbnb pasiūlymai
fig = px.scatter_mapbox(airbnb, lat='lat', lon='lng', hover_name='price_per_person', hover_data=['room_type', 'person_capacity'],
color_discrete_sequence=['red'], zoom=3, size='price_per_person', size_max=20, width=900)
fig.update_layout(mapbox_style='open-street-map')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
print('Didžiausios Airbnb kainos')
pyo.iplot(fig, filename='Didžiausios Airbnb kainos')
Didžiausios Airbnb kainos
# Vidutinė Airbnb kaina pagal miestus darbo dienomis ir savaitgaliais
airbnb_vid_kaina = airbnb.groupby(['city', 'weekday']).agg({'price' : 'mean'}).reset_index().pivot(
'city', 'weekday', 'price').sort_values(by=['weekdays', 'weekends'], ascending = [False, False]).reset_index()
airbnb_vid_kaina.plot(x= 'city', y= ['weekdays', 'weekends'], title= 'Vidutinė Airbnb kaina', kind= 'bar', figsize=(
10, 2),color=['gold', 'skyblue'])
<AxesSubplot:title={'center':'Vidutinė Airbnb kaina'}, xlabel='city'>
# Vidutinė Airbnb kaina žmogui pagal miestus darbo dienomis ir savaitgaliais
airbnb_vid_kaina = airbnb.groupby(['city', 'weekday']).agg({'price_per_person' : 'mean'}).reset_index().pivot(
'city', 'weekday', 'price_per_person').sort_values(by=['weekdays', 'weekends'], ascending = [False, False]).reset_index()
airbnb_vid_kaina.plot(x= 'city', y= ['weekdays', 'weekends'], title= 'Vidutinė Airbnb kaina žmogui', kind= 'bar', figsize=(
10, 2),color=['salmon', 'lightgreen'])
<AxesSubplot:title={'center':'Vidutinė Airbnb kaina žmogui'}, xlabel='city'>
airbnb_vid_kaina
weekday | city | weekdays | weekends |
---|---|---|---|
0 | Amsterdam | 194.624966 | 218.386276 |
1 | Paris | 140.356389 | 134.989334 |
2 | London | 126.515587 | 126.805257 |
3 | Barcelona | 104.601522 | 121.310802 |
4 | Berlin | 90.197117 | 94.640790 |
5 | Vienna | 79.038560 | 78.595960 |
6 | Lisbon | 74.172944 | 76.356491 |
7 | Rome | 64.218883 | 66.477167 |
8 | Budapest | 50.651854 | 56.691099 |
9 | Athens | 46.538078 | 42.694725 |
# Vidutinė Airbnb kaina žmogui pagal miestus darbo dienomis ir savaitgaliais - boxplot
plt.figure(figsize=(12, 3))
ax = plt.subplot()
plt.axis([0,8,0,400])
sns.set_theme(style='ticks', palette='pastel')
sns.boxplot(x='city', y='price_per_person', hue='weekday', palette=['salmon', 'darkgrey'],
data=airbnb, fliersize=0.5, linewidth=1, order=airbnb.groupby(
'city')['price_per_person'].median().sort_values(ascending=False).index)
plt.ylabel('Airbnb_price_per_person')
plt.grid(axis='y', color='lightgrey', linestyle='--', linewidth=.5)
plt.legend(loc=1)
plt.show()
# Švaros įvertinimas
airbnb.groupby('city')['cleanliness_rating'].mean().sort_values(ascending=False)
city Athens 9.638447 Rome 9.514678 Budapest 9.477374 Vienna 9.472434 Amsterdam 9.465865 Berlin 9.461755 Lisbon 9.370640 Barcelona 9.291564 Paris 9.263606 London 9.175023 Name: cleanliness_rating, dtype: float64
# Bendras įvertinimas
airbnb.groupby('city')['guest_satisfaction_overall'].mean().sort_values(ascending=False)
city Athens 95.003598 Budapest 94.585281 Amsterdam 94.514423 Berlin 94.323671 Vienna 93.731128 Rome 93.122300 Paris 92.037530 Barcelona 91.109072 Lisbon 91.093875 London 90.645652 Name: guest_satisfaction_overall, dtype: float64
airbnb.loc[:, ['cleanliness_rating', 'guest_satisfaction_overall']].describe()
cleanliness_rating | guest_satisfaction_overall | |
---|---|---|
count | 51707.000000 | 51707.000000 |
mean | 9.390624 | 92.628232 |
std | 0.954868 | 8.945531 |
min | 2.000000 | 20.000000 |
25% | 9.000000 | 90.000000 |
50% | 10.000000 | 95.000000 |
75% | 10.000000 | 99.000000 |
max | 10.000000 | 100.000000 |
# Ryšys tarp švaros įvertinimo ir bendro Airbnb klientų pasitenkinimo
sns.regplot(data= airbnb, x= 'cleanliness_rating', y= 'guest_satisfaction_overall',
scatter_kws={'color': 'darkkhaki'}, line_kws={'color': 'dimgrey'})
<AxesSubplot:xlabel='cleanliness_rating', ylabel='guest_satisfaction_overall'>
# Koreliacijos stiprumas
corr, p_value = pearsonr(airbnb['cleanliness_rating'], airbnb['guest_satisfaction_overall'])
print('Correlation coefficient:', corr)
print('p-value:', p_value)
Correlation coefficient: 0.7140450220820529 p-value: 0.0
Airbnb kainos nustatymo modelio kūrimui pasirinktas miestas Roma - vienas iš didžiausių pagal Airbnb pasiūlymų kiekį (atroje vietoje po Londono) bei išsiskiriantis labiau koncentruotomis Airbnb kainomis (mažesnis standartinis nuokrypis).b
airbnb_Rome = airbnb[(airbnb['city'] == 'Rome')]
airbnb_Rome.sort_values(by='price_per_person', ascending = False)
price | room_type | room_shared | room_private | person_capacity | host_is_superhost | multiple_room | business_room | cleanliness_rating | guest_satisfaction_overall | bedrooms | citycentre_dist | metro_dist | lng | lat | city | weekday | price_per_person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47009 | 2311.738714 | Private room | False | True | 2.0 | False | False | False | 10.0 | 100.0 | 1 | 1.731672 | 1.054644 | 12.50391 | 41.91641 | Rome | weekends | 1155.869357 |
42518 | 2305.192528 | Private room | False | True | 2.0 | False | False | False | 10.0 | 100.0 | 1 | 1.731677 | 1.054641 | 12.50391 | 41.91641 | Rome | weekdays | 1152.596264 |
45898 | 1384.752063 | Entire home/apt | False | False | 2.0 | False | False | False | 9.0 | 90.0 | 1 | 2.678288 | 0.229584 | 12.51353 | 41.92346 | Rome | weekends | 692.376032 |
41398 | 1380.777593 | Entire home/apt | False | False | 2.0 | False | False | False | 9.0 | 90.0 | 1 | 2.678285 | 0.229584 | 12.51353 | 41.92346 | Rome | weekdays | 690.388797 |
40900 | 2418.348023 | Entire home/apt | False | False | 4.0 | False | False | False | 4.0 | 60.0 | 2 | 0.584549 | 0.377582 | 12.50100 | 41.90605 | Rome | weekdays | 604.587006 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40742 | 71.540458 | Private room | False | True | 4.0 | False | False | True | 7.0 | 78.0 | 1 | 3.709584 | 1.360518 | 12.47000 | 41.92400 | Rome | weekdays | 17.885114 |
40660 | 71.540458 | Private room | False | True | 4.0 | False | False | True | 7.0 | 80.0 | 1 | 3.787374 | 1.463280 | 12.47000 | 41.92500 | Rome | weekdays | 17.885114 |
43805 | 103.803801 | Shared room | True | False | 6.0 | False | True | False | 8.0 | 72.0 | 1 | 2.273166 | 0.241203 | 12.52224 | 41.91486 | Rome | weekends | 17.300634 |
39311 | 103.803801 | Shared room | True | False | 6.0 | False | True | False | 8.0 | 72.0 | 1 | 2.273175 | 0.241213 | 12.52224 | 41.91486 | Rome | weekdays | 17.300634 |
45466 | 102.634840 | Entire home/apt | False | False | 6.0 | False | False | False | 8.0 | 90.0 | 3 | 5.797237 | 3.082330 | 12.44594 | 41.87000 | Rome | weekends | 17.105807 |
9027 rows × 18 columns
airbnb_Rome.describe()
price | person_capacity | cleanliness_rating | guest_satisfaction_overall | bedrooms | citycentre_dist | metro_dist | lng | lat | price_per_person | |
---|---|---|---|---|---|---|---|---|---|---|
count | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 | 9027.000000 |
mean | 205.391950 | 3.357372 | 9.514678 | 93.122300 | 1.229755 | 3.026982 | 0.819794 | 12.486139 | 41.895372 | 65.353404 |
std | 118.618103 | 1.309052 | 0.808415 | 7.815107 | 0.549710 | 1.644095 | 0.631361 | 0.028827 | 0.017964 | 37.352868 |
min | 46.057092 | 2.000000 | 2.000000 | 20.000000 | 0.000000 | 0.042789 | 0.011093 | 12.400790 | 41.818000 | 17.105807 |
25% | 138.405069 | 2.000000 | 9.000000 | 91.000000 | 1.000000 | 1.880467 | 0.325294 | 12.467430 | 41.884000 | 45.005027 |
50% | 182.591822 | 3.000000 | 10.000000 | 95.000000 | 1.000000 | 2.815721 | 0.621587 | 12.480000 | 41.897300 | 57.793468 |
75% | 240.806116 | 4.000000 | 10.000000 | 98.000000 | 1.000000 | 4.030506 | 1.220111 | 12.505560 | 41.907190 | 76.742337 |
max | 2418.348023 | 6.000000 | 10.000000 | 100.000000 | 5.000000 | 9.553819 | 4.147201 | 12.582980 | 41.951780 | 1155.869357 |
# 6 subplots
fig, axs = plt.subplots(2,3, figsize=(12, 4))
# 1
sns.regplot(ax=axs[0,0], data=airbnb_Rome, x='person_capacity', y='price_per_person',
scatter_kws={'color': 'cadetblue'}, line_kws={'color': 'dimgrey'})
axs[0, 0].set_xlim([0, 8])
axs[0, 0].set_ylim([0, 400])
# 2
sns.regplot(ax=axs[0,1], data=airbnb_Rome, x='bedrooms', y='price_per_person',
scatter_kws={'color': 'salmon'}, line_kws={'color': 'dimgrey'})
axs[0, 1].set_xlim([-1, 6])
axs[0, 1].set_ylim([0, 400])
# 3
sns.regplot(ax=axs[0,2], data=airbnb_Rome, x='citycentre_dist', y='price_per_person',
scatter_kws={'color': 'orange'}, line_kws={'color': 'dimgrey'})
axs[0, 2].set_xlim([0, 10])
axs[0, 2].set_ylim([0, 400])
# 4
sns.regplot(ax=axs[1,0], data=airbnb_Rome, x='metro_dist', y='price_per_person',
scatter_kws={'color': 'skyblue'}, line_kws={'color': 'dimgrey'})
axs[1, 0].set_xlim([0, 5])
axs[1, 0].set_ylim([0, 400])
# 5
sns.regplot(ax=axs[1,1], data=airbnb_Rome, x='guest_satisfaction_overall', y='price_per_person',
scatter_kws={'color': 'lightgreen'}, line_kws={'color': 'dimgrey'})
axs[1, 1].set_xlim([0, 100])
axs[1, 1].set_ylim([0, 400])
# 6
sns.regplot(ax=axs[1,2], data=airbnb_Rome, x='cleanliness_rating', y='price_per_person',
scatter_kws={'color': 'darkkhaki'}, line_kws={'color': 'dimgrey'})
axs[1, 2].set_xlim([0, 11])
axs[1, 2].set_ylim([0, 400])
# display the plot
plt.tight_layout()
plt.show()
corr, p_value = pearsonr(airbnb_Rome['person_capacity'], airbnb_Rome['price_per_person'])
print('Correlation coefficient person_capacity:', corr)
print('p-value:', p_value)
corr, p_value = pearsonr(airbnb_Rome['bedrooms'], airbnb_Rome['price_per_person'])
print('Correlation coefficient bedrooms:', corr)
print('p-value:', p_value)
corr, p_value = pearsonr(airbnb_Rome['citycentre_dist'], airbnb_Rome['price_per_person'])
print('Correlation coefficient citycentre_dist:', corr)
print('p-value:', p_value)
corr, p_value = pearsonr(airbnb_Rome['metro_dist'], airbnb_Rome['price_per_person'])
print('Correlation coefficient metro_dist:', corr)
print('p-value:', p_value)
corr, p_value = pearsonr(airbnb_Rome['guest_satisfaction_overall'], airbnb_Rome['price_per_person'])
print('Correlation coefficient guest_satisfaction_overall:', corr)
print('p-value:', p_value)
corr, p_value = pearsonr(airbnb_Rome['cleanliness_rating'], airbnb_Rome['price_per_person'])
print('Correlation coefficient cleanliness_rating:', corr)
print('p-value:', p_value)
Correlation coefficient person_capacity: -0.2868346363176606 p-value: 1.615772175093204e-170 Correlation coefficient bedrooms: -0.09064087331827733 p-value: 6.2210989072449664e-18 Correlation coefficient citycentre_dist: -0.184556414772082 p-value: 5.522114882146471e-70 Correlation coefficient metro_dist: 0.006324911120130413 p-value: 0.5479359538123383 Correlation coefficient guest_satisfaction_overall: 0.03326440792561239 p-value: 0.0015727929521942591 Correlation coefficient cleanliness_rating: 0.02574376340563548 p-value: 0.014445295353994761
selected_attributes= ['person_capacity', 'bedrooms', 'citycentre_dist', 'metro_dist','guest_satisfaction_overall',
'cleanliness_rating' ]
X = airbnb_Rome[selected_attributes]
X.head(2)
person_capacity | bedrooms | citycentre_dist | metro_dist | guest_satisfaction_overall | cleanliness_rating | |
---|---|---|---|---|---|---|
39143 | 2.0 | 1 | 2.978468 | 1.595733 | 95.0 | 10.0 |
39144 | 2.0 | 1 | 0.935371 | 0.649269 | 80.0 | 9.0 |
y= airbnb_Rome['price_per_person']
y.head(2)
39143 78.437332 39144 86.386272 Name: price_per_person, dtype: float64
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LinearRegression()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print(model.score(X_test, y_test))
0.11247942648800247
airbnb_Rome_for_outliers = airbnb_Rome.loc[:, ['price_per_person', 'person_capacity', 'bedrooms', 'citycentre_dist',
'metro_dist', 'guest_satisfaction_overall', 'cleanliness_rating']]
z_scores = stats.zscore(airbnb_Rome_for_outliers)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
airbnb_Rome_excl_outliers = airbnb_Rome_for_outliers[filtered_entries]
airbnb_Rome_excl_outliers.shape
(8315, 7)
airbnb_Rome_excl_outliers.describe()
price_per_person | person_capacity | bedrooms | citycentre_dist | metro_dist | guest_satisfaction_overall | cleanliness_rating | |
---|---|---|---|---|---|---|---|
count | 8315.000000 | 8315.000000 | 8315.000000 | 8315.000000 | 8315.000000 | 8315.000000 | 8315.000000 |
mean | 63.476935 | 3.308358 | 1.176909 | 2.957820 | 0.780007 | 93.784606 | 9.588455 |
std | 25.493340 | 1.271312 | 0.457052 | 1.569335 | 0.568147 | 5.669632 | 0.590366 |
min | 17.300634 | 2.000000 | 0.000000 | 0.042789 | 0.011093 | 70.000000 | 8.000000 |
25% | 45.355715 | 2.000000 | 1.000000 | 1.854963 | 0.320053 | 91.000000 | 9.000000 |
50% | 57.824640 | 3.000000 | 1.000000 | 2.788918 | 0.603644 | 95.000000 | 10.000000 |
75% | 76.099409 | 4.000000 | 1.000000 | 3.961438 | 1.187668 | 98.000000 | 10.000000 |
max | 177.214598 | 6.000000 | 2.000000 | 7.909770 | 2.708675 | 100.000000 | 10.000000 |
X = airbnb_Rome_excl_outliers[selected_attributes]
X.head(2)
person_capacity | bedrooms | citycentre_dist | metro_dist | guest_satisfaction_overall | cleanliness_rating | |
---|---|---|---|---|---|---|
39143 | 2.0 | 1 | 2.978468 | 1.595733 | 95.0 | 10.0 |
39144 | 2.0 | 1 | 0.935371 | 0.649269 | 80.0 | 9.0 |
y= airbnb_Rome_excl_outliers['price_per_person']
y.head(2)
39143 78.437332 39144 86.386272 Name: price_per_person, dtype: float64
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LinearRegression()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print(model.score(X_test, y_test))
0.2512915118846336
1) Didžiojoje dalyje Europos miestų Airbnb vidutinė kaina savaitgaliais buvo didesnė darbo dienomis. Didžiausias skirtumas - Amsterdame, Barselonoje ir Budapešte. Paryžiuje ir Atėnuose nustatyta, kad vidutinė siūlomų Airbnb kaina savaitgaliais buvo mažesnė nei darbo dienomis.
2) Nustatytas stiprus ryšys tarp Airbnb švaros įvertinimo bei Airbnb bendro kliento pasitenkinimo (kuo geresnis įvertinimas švaros tuo geresnis bendras klientų pasitenkinimas).
3) Sukurto Airbnb kainų nustatymo modelio tikslumas pasiektas gana žemas, nors ir buvo pašalintos išskirtys 'outliers' (0.25).
4) Tikėtina, kad turimuose duomenyse pateikiama kaina yra skirtingos trukmės Airbnb nuomai. Siūlymas rasti Airbnb duomenis, kuriuose būtų pateikiamos kainos vienodai nuomos trukmei. Tikėtina tai padėtų pagerinti modelio tikslumą.
import mysql.connector
import pandas as pd
mydb = mysql.connector.connect(
host = "localhost",
port = '3306',
user = 'root',
password = 'xxx'
)
sakila = pd.read_sql('SELECT rating, SUM(length) FROM sakila.film GROUP BY rating', con=mydb)
sakila
C:\Users\egecaite.BAIPGROUP\Anaconda3\lib\site-packages\pandas\io\sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
rating | SUM(length) | |
---|---|---|
0 | PG | 21729.0 |
1 | G | 19767.0 |
2 | NC-17 | 23778.0 |
3 | PG-13 | 26859.0 |
4 | R | 23139.0 |
sakila = pd.read_sql('SELECT * FROM sakila.film WHERE rating IN ("G", "PG")', con=mydb)
sakila
C:\Users\egecaite.BAIPGROUP\Anaconda3\lib\site-packages\pandas\io\sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
film_id | title | description | release_year | language_id | original_language_id | rental_duration | rental_rate | length | replacement_cost | rating | special_features | last_update | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ACADEMY DINOSAUR | A Epic Drama of a Feminist And a Mad Scientist... | 2006 | 1 | None | 6 | 0.99 | 86 | 20.99 | PG | {Deleted Scenes, Behind the Scenes} | 2006-02-15 05:03:42 |
1 | 2 | ACE GOLDFINGER | A Astounding Epistle of a Database Administrat... | 2006 | 1 | None | 3 | 4.99 | 48 | 12.99 | G | {Deleted Scenes, Trailers} | 2006-02-15 05:03:42 |
2 | 4 | AFFAIR PREJUDICE | A Fanciful Documentary of a Frisbee And a Lumb... | 2006 | 1 | None | 5 | 2.99 | 117 | 26.99 | G | {Behind the Scenes, Commentaries} | 2006-02-15 05:03:42 |
3 | 5 | AFRICAN EGG | A Fast-Paced Documentary of a Pastry Chef And ... | 2006 | 1 | None | 6 | 2.99 | 130 | 22.99 | G | {Deleted Scenes} | 2006-02-15 05:03:42 |
4 | 6 | AGENT TRUMAN | A Intrepid Panorama of a Robot And a Boy who m... | 2006 | 1 | None | 3 | 2.99 | 169 | 17.99 | PG | {Deleted Scenes} | 2006-02-15 05:03:42 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
367 | 983 | WON DARES | A Unbelieveable Documentary of a Teacher And a... | 2006 | 1 | None | 7 | 2.99 | 105 | 18.99 | PG | {Behind the Scenes} | 2006-02-15 05:03:42 |
368 | 985 | WONDERLAND CHRISTMAS | A Awe-Inspiring Character Study of a Waitress ... | 2006 | 1 | None | 4 | 4.99 | 111 | 19.99 | PG | {Commentaries} | 2006-02-15 05:03:42 |
369 | 987 | WORDS HUNTER | A Action-Packed Reflection of a Composer And a... | 2006 | 1 | None | 3 | 2.99 | 116 | 13.99 | PG | {Deleted Scenes, Trailers, Commentaries} | 2006-02-15 05:03:42 |
370 | 991 | WORST BANGER | A Thrilling Drama of a Madman And a Dentist wh... | 2006 | 1 | None | 4 | 2.99 | 185 | 26.99 | PG | {Deleted Scenes, Behind the Scenes} | 2006-02-15 05:03:42 |
371 | 996 | YOUNG LANGUAGE | A Unbelieveable Yarn of a Boat And a Database ... | 2006 | 1 | None | 6 | 0.99 | 183 | 9.99 | G | {Trailers, Behind the Scenes} | 2006-02-15 05:03:42 |
372 rows × 13 columns