This notebook serves two main purposes:
We will try to present some common techniques to represent data, more over we will try to show how different types of plots or data manipulation can make a plot more interpretable.
Important: This is the short and bried version of the main EDA notebook, which doesn't show in details how and what was done with data. Consider it as a dashboard. For learning something about code and exploratory analysis, check more verbous version.
This notebook is available at the github repo for corrections, addictions and edits. All pull requests are welcome.
import sys
from io import StringIO
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud
from collections import Counter
%matplotlib inline
plt.style.use('seaborn')
from eda_utils import Eda
%load_ext autoreload
%autoreload 2
eda = Eda()
eda.df.head()
eda.plot_date_count()
eda.plot_top_countries()
Countries distribution among audience can be explained by popularity of Telegram in the corresponding countries. Tests to verify this hypothesis are left as the exercise to the readers (please submit PR if you actually do something :) ).
eda.plot_feature_count('Work', 'Work status of the audience')
Most of people either work in companies or study to become data scientists.
eda.plot_work_country()
So we can conclude that generally the distribution is quite uniform across different countries, except for India where the number of student is slightly greater then the others top 5 countries.
eda.plot_age()
The audience is quite young.
eda.plot_feature_count('Education', 'Education status of the audience')
eda.plot_feature_count('Experience', 'Experience status of the audience')
eda.plot_age_experience()
One of the main reasons behind the survey is to understand if the audience is satified about the complexity of the material. To better understand if the audience is satisfied we can (again) use a Violin Plot.
eda.plot_satistaction()
Let's do the same for the update frequency.
eda.plot_feature_count('Sat_update', 'Distribution of satisfaction for update frequency')
eda.plot_feature_count('Recommend', 'How likely that you are going to recommend a channel to a friend?')
Given the people interests, create a WordCloud to see trends of topics
eda.display_wordcloud_image('Interests')
We could also see if particular trends occures in different countries. We could use a WordCloud for each countries or we could select the top-3 topics of countries. Let's try with the hardest way, the latter one.
eda.plot_countries_interests()
From this plot we can see that DeepLearning and image processing are very popular across countries, while in India people demand for beginners stuff. This can be derived by the fact that, as we saw in 03 and 04 India has the highest percentage of student and people ranging from 18 to 24 years old.
eda.display_wordcloud_image('How_found')