OpenDataScience channel audience stats¶

This notebook serves two main purposes:

Describes audience statistics of Open Data Science telegram channel.
Shows how Exploratory Data Analysis can be performed.

We will try to present some common techniques to represent data, more over we will try to show how different types of plots or data manipulation can make a plot more interpretable.

Important: This is the short and bried version of the main EDA notebook, which doesn't show in details how and what was done with data. Consider it as a dashboard. For learning something about code and exploratory analysis, check more verbous version.

This notebook is available at the github repo for corrections, addictions and edits. All pull requests are welcome.

Imports¶

import sys
from io import StringIO
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud
from collections import Counter

%matplotlib inline
plt.style.use('seaborn')
from eda_utils import Eda
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

01. Data preparation and general information¶

eda = Eda()
eda.df.head()

eda.plot_date_count()

eda.plot_top_countries()

Countries distribution among audience can be explained by popularity of Telegram in the corresponding countries. Tests to verify this hypothesis are left as the exercise to the readers (please submit PR if you actually do something :) ).

02 Work status¶

eda.plot_feature_count('Work', 'Work status of the audience')

Most of people either work in companies or study to become data scientists.

03 Countries and Work Status together¶

eda.plot_work_country()

So we can conclude that generally the distribution is quite uniform across different countries, except for India where the number of student is slightly greater then the others top 5 countries.

04 Age¶

eda.plot_age()

The audience is quite young.

05 Education¶

eda.plot_feature_count('Education', 'Education status of the audience')

06 Experience¶

eda.plot_feature_count('Experience', 'Experience status of the audience')

eda.plot_age_experience()

/Users/kupa/Library/Caches/pypoetry/virtualenvs/ods-channel-stats-eda-5hJ1F2Av-py3.7/lib/python3.7/site-packages/seaborn/categorical.py:3695: UserWarning:

The `size` parameter has been renamed to `height`; please update your code.

<seaborn.axisgrid.FacetGrid at 0x12b0b1390>

07 Is the audience satisfied about the material and update frequency?¶

One of the main reasons behind the survey is to understand if the audience is satified about the complexity of the material. To better understand if the audience is satisfied we can (again) use a Violin Plot.

eda.plot_satistaction()

Let's do the same for the update frequency.

eda.plot_feature_count('Sat_update', 'Distribution of satisfaction for update frequency')

08 Recommend chance¶

eda.plot_feature_count('Recommend', 'How likely that you are going to recommend a channel to a friend?')

09 Interests¶

Given the people interests, create a WordCloud to see trends of topics

eda.display_wordcloud_image('Interests')

We could also see if particular trends occures in different countries. We could use a WordCloud for each countries or we could select the top-3 topics of countries. Let's try with the hardest way, the latter one.

eda.plot_countries_interests()

From this plot we can see that DeepLearning and image processing are very popular across countries, while in India people demand for beginners stuff. This can be derived by the fact that, as we saw in 03 and 04 India has the highest percentage of student and people ranging from 18 to 24 years old.

10 source¶

eda.display_wordcloud_image('How_found')

	Timestamp	Country	Timezone	Education	Work	Experience	Age	Sat_update	Sat_material	Interests	How_found	Recommend	Why	If you want to reach for the editors and to write something, please use the field below:
0	2020-01-29 13:30:57-03:00	Ukraine	GMT+3	Undergrad	Student + part time remote job	Middle	18-24	Yes, it's about perfect	It's all ok	#CV #DL #imageprocessing #videolearning;#RL #D...	Forward from a friend	5	All stuff is absolutely brilliant! Thank you f...	NaN
1	2020-01-29 13:31:19-03:00	Russia	GMT+3	Graduate	Employed	Middle	31-42	Nope, less frequent posting will be all right ...	It's all ok	#RL #DL;#NLP #NLU #conversational #dialoguesys...	It's been so long time ago, I can't remember (...	4	it's ok	post some jobs with salary ranges, especially ...
2	2020-01-29 13:32:48-03:00	Ukraine	GMT+2	PhD	Unemployed	Novice (Studying courses, active learning)	25-30	Yes, it's about perfect	Need more specific and complicated materials	#WhereToStart #EntryLevel #Novice #MOOC #Learn...	Telegram channel search	3	NaN	NaN
3	2020-01-29 13:33:27-03:00	Italy	GMT+1	No degree at all, still learning / self-taught	Student	Novice (Studying courses, active learning)	18-24	Yes, it's about perfect	It's all ok	#CV #DL #imageprocessing #videolearning;#RL #D...	Forward from a friend	5	Mainly due to material shared	NaN
4	2020-01-29 13:33:49-03:00	Ukraine	GMT+2	Graduate	Employed	Middle	18-24	Yes, it's about perfect	It's all ok	#RL #DL;#NLP #NLU #conversational #dialoguesys...	It's been so long time ago, I can't remember (...	2	It's not super useful actually. Good enough to...	NaN