Python:项目-电影数据集分析代码 (三十九)

探索电影数据集

在这个项目中,你将尝试使用所学的知识,使用 NumPyPandasmatplotlibseaborn 库中的函数,来对电影数据集进行探索。

下载数据集:
TMDb电影数据

数据集各列名称的含义:

列名称idimdb_idpopularitybudgetrevenueoriginal_titlecasthomepagedirectortaglinekeywordsoverviewruntimegenresproduction_companiesrelease_datevote_countvote_averagerelease_yearbudget_adjrevenue_adj
含义编号IMDB 编号知名度预算票房名称主演网站导演宣传词关键词简介时常类别发行公司发行日期投票总数投票均值发行年份预算(调整后)票房(调整后)

请注意,你需要提交该报告导出的 .html.ipynb 以及 .py 文件。



第一节 数据的导入与处理

在这一部分,你需要编写代码,使用 Pandas 读取数据,并进行预处理。

任务1.1: 导入库以及数据

  1. 载入需要的库 NumPyPandasmatplotlibseaborn
  2. 利用 Pandas 库,读取 tmdb-movies.csv 中的数据,保存为 movie_data

提示:记得使用 notebook 中的魔法指令 %matplotlib inline,否则会导致你接下来无法打印出图像。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

# 使用 notebook 中的魔法指令 %matplotlib inline,否则会导致你接下来无法打印出图像
%matplotlib inline

# 读取数据
movie_data = pd.read_csv('./data/tmdb-movies.csv')

任务1.2: 了解数据

你会接触到各种各样的数据表,因此在读取之后,我们有必要通过一些简单的方法,来了解我们数据表是什么样子的。

  1. 获取数据表的行列,并打印。
  2. 使用 .head().tail().sample() 方法,观察、了解数据表的情况。
  3. 使用 .dtypes 属性,来查看各列数据的数据类型。
  4. 使用 isnull() 配合 .any() 等方法,来查看各列是否存在空值。
  5. 使用 .describe() 方法,看看数据表中数值型的数据是怎么分布的。
# 使用 .head()
movie_data.head(5)
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

# print tail
movie_data.tail(5)
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
10861 21 tt0060371 0.080598 0 0 The Endless Summer Michael Hynson|Robert August|Lord 'Tally Ho' B... NaN Bruce Brown NaN ... The Endless Summer, by Bruce Brown, is one of ... 95 Documentary Bruce Brown Films 6/15/66 11 7.4 1966 0.000000 0.0
10862 20379 tt0060472 0.065543 0 0 Grand Prix James Garner|Eva Marie Saint|Yves Montand|Tosh... NaN John Frankenheimer Cinerama sweeps YOU into a drama of speed and ... ... Grand Prix driver Pete Aron is fired by his te... 176 Action|Adventure|Drama Cherokee Productions|Joel Productions|Douglas ... 12/21/66 20 5.7 1966 0.000000 0.0
10863 39768 tt0060161 0.065141 0 0 Beregis Avtomobilya Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z... NaN Eldar Ryazanov NaN ... An insurance agent who moonlights as a carthie... 94 Mystery|Comedy Mosfilm 1/1/66 11 6.5 1966 0.000000 0.0
10864 21449 tt0061177 0.064317 0 0 What's Up, Tiger Lily? Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... NaN Woody Allen WOODY ALLEN STRIKES BACK! ... In comic Woody Allen's film debut, he took the... 80 Action|Comedy Benedict Pictures Corp. 11/2/66 22 5.4 1966 0.000000 0.0
10865 22293 tt0060666 0.035919 19000 0 Manos: The Hands of Fate Harold P. Warren|Tom Neyman|John Reynolds|Dian... NaN Harold P. Warren It's Shocking! It's Beyond Your Imagination! ... A family gets lost on the road and stumbles up... 74 Horror Norm-Iris 11/15/66 15 1.5 1966 127642.279154 0.0

5 rows × 21 columns

# print('sample:')
movie_data.sample()
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
5477 112205 tt2404311 1.483329 30000000 36894225 The Family Robert De Niro|Michelle Pfeiffer|Dianna Agron|... NaN Luc Besson Some call it organized crime. Others call it f... ... The Manzoni family, a notorious mafia clan, is... 111 Crime|Comedy|Action Canal Plus|TF1 Films Production|Grive Producti... 9/13/13 710 6.1 2013 2.808100e+07 3.453423e+07

1 rows × 21 columns

# 3、使用 .dtypes 属性,来查看各列数据的数据类型
movie_data.dtypes
id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object
# 4.使用 isnull() 配合 .any() 等方法,来查看各列是否存在空值。
movie_data.isnull().any()
id                      False
imdb_id                  True
popularity              False
budget                  False
revenue                 False
original_title          False
cast                     True
homepage                 True
director                 True
tagline                  True
keywords                 True
overview                 True
runtime                 False
genres                   True
production_companies     True
release_date            False
vote_count              False
vote_average            False
release_year            False
budget_adj              False
revenue_adj             False
dtype: bool
# 5.使用 .describe() 方法,看看数据表中数值型的数据是怎么分布的。
movie_data.describe()
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

任务1.3: 清理数据

在真实的工作场景中,数据处理往往是最为费时费力的环节。但是幸运的是,我们提供给大家的 tmdb 数据集非常的「干净」,不需要大家做特别多的数据清洗以及处理工作。在这一步中,你的核心的工作主要是对数据表中的空值进行处理。你可以使用 .fillna() 来填补空值,当然也可以使用 .dropna() 来丢弃数据表中包含空值的某些行或者列。

任务:使用适当的方法来清理空值,并将得到的数据保存。

#movie_data.info()
#通过上边的方法 movie_data.isnull().any(),可以找出NaN的列,可以对无关紧要的列做删除
#首先删除缺失较为严重且无关紧要的列:'homepage','tagline','keywords','production_companies'
#然后再删除轻微缺失的行:‘imdb_id’,‘cast’,‘director’,‘overview’,‘genres’

movie_data = movie_data.drop(columns = ['homepage','tagline','keywords','production_companies'])
movie_data.dropna(axis = 0, inplace = True)
# movie_data.isnull().any()
movie_data.head(5)
id imdb_id popularity budget revenue original_title cast director overview runtime genres release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... J.J. Abrams Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... James Wan Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09


第二节 根据指定要求读取数据

相比 Excel 等数据分析软件,Pandas 的一大特长在于,能够轻松地基于复杂的逻辑选择合适的数据。因此,如何根据指定的要求,从数据表当获取适当的数据,是使用 Pandas 中非常重要的技能,也是本节重点考察大家的内容。


任务2.1: 简单读取

  1. 读取数据表中名为 idpopularitybudgetruntimevote_average 列的数据。
  2. 读取数据表中前1~20行以及48、49行的数据。
  3. 读取数据表中第50~60行的 popularity 那一列的数据。

要求:每一个语句只能用一行代码实现。

# 读取数据表中名为 id、popularity、budget、runtime、vote_average 列的数据, 取出 5 条测试打印
movie_data[['id','popularity','budget','runtime','vote_average']].head(5)
id popularity budget runtime vote_average
0 135397 32.985763 150000000 124 6.5
1 76341 28.419936 150000000 120 7.1
2 262500 13.112507 110000000 119 6.3
3 140607 11.173104 200000000 136 7.5
4 168259 9.335014 190000000 137 7.3
# 读取数据表中前1~20行以及48、49行的数据,即追加行
movie_data[0:20].append(movie_data[47:49])
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09
5 281957 tt1663202 9.110700 135000000 532950503 The Revenant Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn... http://www.foxmovies.com/movies/the-revenant Alejandro González Iñárritu (n. One who has returned, as if from the dead.) ... In the 1820s, a frontiersman, Hugh Glass, sets... 156 Western|Drama|Adventure|Thriller Regency Enterprises|Appian Way|CatchPlay|Anony... 12/25/15 3929 7.2 2015 1.241999e+08 4.903142e+08
6 87101 tt1340138 8.654359 155000000 440603537 Terminator Genisys Arnold Schwarzenegger|Jason Clarke|Emilia Clar... http://www.terminatormovie.com/ Alan Taylor Reset the future ... The year is 2029. John Connor, leader of the r... 125 Science Fiction|Action|Thriller|Adventure Paramount Pictures|Skydance Productions 6/23/15 2598 5.8 2015 1.425999e+08 4.053551e+08
7 286217 tt3659388 7.667400 108000000 595380321 The Martian Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ... http://www.foxmovies.com/movies/the-martian Ridley Scott Bring Him Home ... During a manned mission to Mars, Astronaut Mar... 141 Drama|Adventure|Science Fiction Twentieth Century Fox Film Corporation|Scott F... 9/30/15 4572 7.6 2015 9.935996e+07 5.477497e+08
8 211672 tt2293640 7.404165 74000000 1156730962 Minions Sandra Bullock|Jon Hamm|Michael Keaton|Allison... http://www.minionsmovie.com/ Kyle Balda|Pierre Coffin Before Gru, they had a history of bad bosses ... Minions Stuart, Kevin and Bob are recruited by... 91 Family|Animation|Adventure|Comedy Universal Pictures|Illumination Entertainment 6/17/15 2893 6.5 2015 6.807997e+07 1.064192e+09
9 150540 tt2096673 6.326804 175000000 853708609 Inside Out Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha... http://movies.disney.com/inside-out Pete Docter Meet the little voices inside your head. ... Growing up can be a bumpy road, and it's no ex... 94 Comedy|Animation|Family Walt Disney Pictures|Pixar Animation Studios|W... 6/9/15 3935 8.0 2015 1.609999e+08 7.854116e+08
10 206647 tt2379713 6.200282 245000000 880674609 Spectre Daniel Craig|Christoph Waltz|Léa Seydoux|Ralp... http://www.sonypictures.com/movies/spectre/ Sam Mendes A Plan No One Escapes ... A cryptic message from Bond’s past sends him... 148 Action|Adventure|Crime Columbia Pictures|Danjaq|B24 10/26/15 3254 6.2 2015 2.253999e+08 8.102203e+08
11 76757 tt1617661 6.189369 176000003 183987723 Jupiter Ascending Mila Kunis|Channing Tatum|Sean Bean|Eddie Redm... http://www.jupiterascending.com Lana Wachowski|Lilly Wachowski Expand your universe. ... In a universe where human genetic material is ... 124 Science Fiction|Fantasy|Action|Adventure Village Roadshow Pictures|Dune Entertainment|A... 2/4/15 1937 5.2 2015 1.619199e+08 1.692686e+08
12 264660 tt0470752 6.118847 15000000 36869414 Ex Machina Domhnall Gleeson|Alicia Vikander|Oscar Isaac|S... http://exmachina-movie.com/ Alex Garland There is nothing more human than the will to s... ... Caleb, a 26 year old coder at the world's larg... 108 Drama|Science Fiction DNA Films|Universal Pictures International (UP... 1/21/15 2854 7.6 2015 1.379999e+07 3.391985e+07
13 257344 tt2120120 5.984995 88000000 243637091 Pixels Adam Sandler|Michelle Monaghan|Peter Dinklage|... http://www.pixels-movie.com/ Chris Columbus Game On. ... Video game experts are recruited by the milita... 105 Action|Comedy|Science Fiction Columbia Pictures|Happy Madison Productions 7/16/15 1575 5.8 2015 8.095996e+07 2.241460e+08
14 99861 tt2395427 5.944927 280000000 1405035767 Avengers: Age of Ultron Robert Downey Jr.|Chris Hemsworth|Mark Ruffalo... http://marvel.com/movies/movie/193/avengers_ag... Joss Whedon A New Age Has Come. ... When Tony Stark tries to jumpstart a dormant p... 141 Action|Adventure|Science Fiction Marvel Studios|Prime Focus|Revolution Sun Studios 4/22/15 4304 7.4 2015 2.575999e+08 1.292632e+09
15 273248 tt3460252 5.898400 44000000 155760117 The Hateful Eight Samuel L. Jackson|Kurt Russell|Jennifer Jason ... http://thehatefuleight.com/ Quentin Tarantino No one comes up here without a damn good reason. ... Bounty hunters seek shelter from a raging bliz... 167 Crime|Drama|Mystery|Western Double Feature Films|The Weinstein Company|Fil... 12/25/15 2389 7.4 2015 4.047998e+07 1.432992e+08
16 260346 tt2446042 5.749758 48000000 325771424 Taken 3 Liam Neeson|Forest Whitaker|Maggie Grace|Famke... http://www.taken3movie.com/ Olivier Megaton It Ends Here ... Ex-government operative Bryan Mills finds his ... 109 Crime|Action|Thriller Twentieth Century Fox Film Corporation|M6 Film... 1/1/15 1578 6.1 2015 4.415998e+07 2.997096e+08
17 102899 tt0478970 5.573184 130000000 518602163 Ant-Man Paul Rudd|Michael Douglas|Evangeline Lilly|Cor... http://marvel.com/movies/movie/180/ant-man Peyton Reed Heroes Don't Get Any Bigger ... Armed with the astonishing ability to shrink i... 115 Science Fiction|Action|Adventure Marvel Studios 7/14/15 3779 7.0 2015 1.195999e+08 4.771138e+08
18 150689 tt1661199 5.556818 95000000 542351353 Cinderella Lily James|Cate Blanchett|Richard Madden|Helen... 0 Kenneth Branagh Midnight is just the beginning. ... When her father unexpectedly passes away, youn... 112 Romance|Fantasy|Family|Drama Walt Disney Pictures|Genre Films|Beagle Pug Fi... 3/12/15 1495 6.8 2015 8.739996e+07 4.989630e+08
19 131634 tt1951266 5.476958 160000000 650523427 The Hunger Games: Mockingjay - Part 2 Jennifer Lawrence|Josh Hutcherson|Liam Hemswor... http://www.thehungergames.movie/ Francis Lawrence The fire will burn forever. ... With the nation of Panem in a full scale war, ... 136 War|Adventure|Science Fiction Studio Babelsberg|StudioCanal|Lionsgate|Walt D... 11/18/15 2380 6.5 2015 1.471999e+08 5.984813e+08
47 286565 tt3622592 2.968254 12000000 85512300 Paper Towns Nat Wolff|Cara Delevingne|Halston Sage|Justice... 0 Jake Schreier Get Lost. Get Found. ... Quentin Jacobsen has spent a lifetime loving t... 109 Drama|Mystery|Romance Fox 2000 Pictures 7/9/15 1252 6.2 2015 1.104000e+07 7.867128e+07
48 265208 tt2231253 2.932340 30000000 0 Wild Card Jason Statham|Michael Angarano|Milo Ventimigli... 0 Simon West Never bet against a man with a killer hand. ... When a Las Vegas bodyguard with lethal skills ... 92 Thriller|Crime|Drama Current Entertainment|Lionsgate|Sierra / Affin... 1/14/15 481 5.3 2015 2.759999e+07 0.000000e+00

22 rows × 21 columns

# 读取数据表中第50~60行的 popularity 那一列的数据,注意不包含尾,第几行实际下标为(index - 1,index_end)
# 第 50~60行 即为 49~60
movie_data[49:60][['popularity']]
popularity
49 2.885126
50 2.883233
51 2.814802
52 2.798017
53 2.793297
54 2.614499
55 2.584264
56 2.578919
57 2.575711
58 2.557859
59 2.550747

任务2.2: 逻辑读取(Logical Indexing)

  1. 读取数据表中 popularity 大于5 的所有数据。
  2. 读取数据表中 popularity 大于5 的所有数据且发行年份在1996年之后的所有数据。

提示:Pandas 中的逻辑运算符如 &|,分别代表以及

要求:请使用 Logical Indexing实现。

# 读取数据表中 popularity 大于5 的所有数据,取出前5个打印
movie_data[movie_data.popularity > 5].head(5)
id imdb_id popularity budget revenue original_title cast director overview runtime genres release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... Robert Schwentke Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... J.J. Abrams Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... James Wan Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09
# 读取数据表中 popularity 大于5 的所有数据且发行年份在1996年之后的所有数据。
movie_data[(movie_data.popularity > 5) & (movie_data.release_year >= 1996)].head(5)
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns


任务2.3: 分组读取

  1. release_year 进行分组,使用 .agg 获得 revenue 的均值。
  2. director 进行分组,使用 .agg 获得 popularity 的均值,从高到低排列。

要求:使用 Groupby 命令实现。

# 计算每年的年度收益
movie_data.groupby(['release_year'])['revenue'].mean().head(5)
release_year
1960    4.531406e+06
1961    1.089420e+07
1962    6.736870e+06
1963    5.511911e+06
1964    8.118614e+06
Name: revenue, dtype: float64
# 对 director 进行分组,使用 .agg 获得 popularity 的均值,从高到低排列, 只打印Top 10 导演, sort_values排序方法
movie_data.groupby(['director'])['popularity'].mean().sort_values(ascending=False).head(10)
director
Colin Trevorrow                16.696886
Joe Russo|Anthony Russo        12.971027
Chad Stahelski|David Leitch    11.422751
Don Hall|Chris Williams         8.691294
Juno John Lee                   8.411577
Kyle Balda|Pierre Coffin        7.404165
Alan Taylor                     6.883129
Peter Richardson                6.668990
Pete Docter                     6.326804
Christopher Nolan               6.195521
Name: popularity, dtype: float64


第三节 绘图与可视化

接着你要尝试对你的数据进行图像的绘制以及可视化。这一节最重要的是,你能够选择合适的图像,对特定的可视化目标进行可视化。所谓可视化的目标,是你希望从可视化的过程中,观察到怎样的信息以及变化。例如,观察票房随着时间的变化、哪个导演最受欢迎等。

可视化的目标可以使用的图像
表示某一属性数据的分布饼图、直方图、散点图
表示某一属性数据随着某一个变量变化条形图、折线图、热力图
比较多个属性的数据之间的关系散点图、小提琴图、堆积条形图、堆积折线图

在这个部分,你需要根据题目中问题,选择适当的可视化图像进行绘制,并进行相应的分析。对于选做题,他们具有一定的难度,你可以尝试挑战一下~

任务3.1:popularity 最高的20名电影绘制其 popularity 值。

# movie_data[['original_title','popularity']].sort_values(by='popularity', ascending=False)[:20]
# top_movies = movie_data.set_index('original_title')['popularity'].sort_values()[-20:]
top_movies = movie_data[['original_title','popularity']].sort_values(by='popularity', ascending=False)[:20].sort_values(by='popularity', ascending=True)

# 设置颜色
# @see doc:
# https://stackoverflow.com/questions/18973404/setting-different-bar-color-in-matplotlib-python
my_colors = 'rgbkymc'  #red, green, blue, black, etc.

# 设置索引为 original_title 列,使用barh
top_movies.set_index('original_title').plot(kind='barh', color=my_colors)
plt.xlabel('Popularity')
plt.ylabel('Original Title')
plt.title('Top 20 Movies by Popularity');

file


任务3.2:分析电影净利润(票房-成本)随着年份变化的情况,并简单进行分析。

# 增加新的列,利润=票房-成本
movie_data['profit'] = movie_data['revenue_adj'] - movie_data['budget_adj']
# movie_data.head(5)

# 以年为组,统计年均利润
movie_data.groupby(['release_year'])['profit'].mean().plot(kind='line', figsize=(16, 8))

plt.ylabel('profit_mean')
Text(0,0.5,'profit_mean')

file

# 以年为组,计算标准差
movie_data.groupby(['release_year'])['profit'].std().plot(kind='line', figsize=(16, 8))
plt.ylabel('profit_std')
Text(0,0.5,'profit_std')

file

# 统计年发行量
movie_data.groupby('release_year')['original_title'].count().plot(kind='line', figsize=(16, 8));
plt.ylabel('profit_sum')
Text(0,0.5,'profit_sum')

file

# 分析电影净利润
# 1、电影利润在1960~1980年代有很大的波动,随着时间的推移,后来趋于稳定
# 2、随着每年电影的产量逐步提高,每部电影的平均净利润逐年减少

[选做]任务3.3:选择最多产的10位导演(电影数量最多的),绘制他们排行前3的三部电影的票房情况,并简要进行分析。

# 1、先选择多产的前10位导演 
# new_movie_data = movie_data[['original_title', 'revenue', 'director']]
# new_movie_data.groupby(new_movie_data['director'])['original_title'].count().sort_values(ascending=False).head(10)

tmp = movie_data['director'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('director') 
movie_data_split = movie_data[['original_title', 'revenue']].join(tmp)
movie_data_split.groupby(movie_data_split['director'])['original_title'].count().sort_values(ascending=False).head(10)
director
Woody Allen          46
Clint Eastwood       34
Martin Scorsese      31
Steven Spielberg     30
Ridley Scott         23
Steven Soderbergh    23
Ron Howard           22
Joel Schumacher      21
Tim Burton           20
Brian De Palma       20
Name: original_title, dtype: int64
# 2、获取每位导演的票房前 3 部电影
directors = list(movie_data_split.groupby(movie_data_split['director'])['original_title'].count().sort_values(ascending=False).head(10).index)
f1, f2, f3, f4 = [],[],[],[]
for director in directors:
    #每个导演 top 3 的电影,分别取出电影名称、票房、导演、评价
    a = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['original_title']).index)
    b = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['revenue_adj']).index)
    c = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['director']).index)
    d = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['vote_average']).index)
    f1 += a
    f2 += b
    f3 += c
    f4 += d
items = {'director': pd.Series(f3),'original_title': pd.Series(f1), 'revenue_adj': pd.Series(f2),'vote_average': pd.Series(f4)}
df = pd.DataFrame(items)
#将导演设置为索引
df.set_index('director',inplace = True)
df
original_title revenue_adj vote_average
director
Woody Allen Manhattan 1.200223e+08 7.7
Woody Allen Annie Hall 1.376203e+08 7.6
Woody Allen Hannah and Her Sisters 7.974345e+07 7.3
Clint Eastwood Million Dollar Baby 2.502418e+08 7.6
Clint Eastwood Gran Torino 2.734101e+08 7.6
Clint Eastwood Unforgiven 2.473345e+08 7.5
Martin Scorsese The Last Waltz 1.076189e+06 8.0
Martin Scorsese Goodfellas 7.816519e+07 8.0
Martin Scorsese George Harrison: Living in the Material World 0.000000e+00 8.0
Steven Spielberg Schindler's List 4.849410e+08 8.1
Steven Spielberg Saving Private Ryan 6.445564e+08 7.7
Steven Spielberg Catch Me If You Can 4.268546e+08 7.6
Ridley Scott Blade Runner 7.404548e+07 7.7
Ridley Scott Gladiator 5.795065e+08 7.7
Ridley Scott The Martian 5.477497e+08 7.6
Steven Soderbergh Ocean's Eleven 5.550528e+08 7.0
Steven Soderbergh Erin Brockovich 3.245143e+08 6.9
Steven Soderbergh The Limey 4.179939e+06 6.6
Ron Howard Rush 8.447479e+07 7.7
Ron Howard A Beautiful Mind 3.861237e+08 7.5
Ron Howard Apollo 13 5.083337e+08 7.1
Joel Schumacher Falling Down 6.174274e+07 7.0
Joel Schumacher A Time to Kill 2.116828e+08 7.0
Joel Schumacher The Phantom of the Opera 1.785337e+08 6.8
Tim Burton Vincent 0.000000e+00 7.9
Tim Burton Edward Scissorhands 8.845162e+07 7.4
Tim Burton Big Fish 1.457024e+08 7.4
Brian De Palma Scarface 1.442422e+08 7.8
Brian De Palma Phantom of the Paradise 0.000000e+00 7.5
Brian De Palma The Untouchables 1.463691e+08 7.5

[选做]任务3.4:分析1968年~2015年六月电影的数量的变化。

#获取1968年~2015年
new_data = movie_data
sel_year = new_data['release_year'].between(1968,2015,inclusive = True)
#获取6月份
sel_june = list(map(lambda x: (pd.to_datetime(x).month) == 6, new_data['release_date']))

new_data[sel_year&sel_june]['release_year'].value_counts().sort_index().plot(kind='line', figsize=(20, 10), lw = 3);

plt.xlabel('release_year', fontsize = 16);
plt.ylabel('movies_in_June_from_1968_to_2015', fontsize = 16);
plt.grid(True)

file

#1968年~2015年六月电影的数量的变化:大体上为上升趋势,短时期内有回落现象,且进入2000后上升趋势加快

[选做]任务3.5:分析1968年~2015年六月电影 ComedyDrama 两类电影的数量的变化。

为者常成,行者常至