Python爬取Facebook公共主页帖⼦
Resource Recommendation
前段时间做项⽬需要爬Facebook,但因为疫情原因官⽅的个⼈Graph API暂停申请权限,抓⽿挠腮之际只能奔向万能的GitHub资源。多多少少试了好多包,把个⼈觉得⽐较好的罗列在下⾯,仅供个⼈学习和交流,不⽤于商业⽤途。
1. 在线 Facebook主页基本信息(公开的地址、电话、邮箱、营业时间等等)爬取⼯具, 快速便捷,有免费试⽤版。
phantombuster/automations/facebook/8369/facebook-profile-scraper
2. 来⾃GitHub,试了下爬取个⼈主页的相关帖⼦、视频等等还是很强⼤的,需要有效的credentials(注册邮箱和密码)。
github/harismuneer/Ultimate-Facebook-Scraper
3. 来⾃GitHub,可以爬取公共主页所有帖⼦、对应时间、转赞评数⽬、帖⼦ID等,不需要credentials,是我到的少数⼏个能爬公共
主页的有效代码,可惜评论的具体内容⽆法爬取。github/kevinzg/facebook-scraper
Practical Usage
最终选择上述第三种⽅法来爬取⽬标公司Facebook公共主页的所有帖⼦并输出xlsx数据:
import re
import time
import datetime
import pandas as pd
import numpy as np
from Facebook_Scraper.facebook_scraper import get_posts
from Facebook_Scraper.facebook_scraper import fetch_share_and_reactions
def facebook_scrap():
# The data type of incorporation date and dissolution date is timestamp, we'll convert them into string containing only date.
data = pd.read_excel('../data/dataset.xlsx',converters={'Date of Establishment_legal':str,'Dissolved_legal':str})
# Column 'Date of Establishment_legal' contains the company's incorporation date, column 'Dissolved_legal' contains the company's dissolution date, an d column 'Facebook' contains the link of the Facebook public page of the company if any.
# We only extract companies with Facebook links
data = data[data['Facebook'].notna()]
data['Date of Establishment_legal']= data['Date of Establishment_legal'].apply(lambda x: x[0:10])
data['Dissolved_legal']= data['Dissolved_legal'].apply(lambda x: x[0:10]if type(x)==str else(x))
# The input of Facebook scraping code should be its account name, so we need to extract account name from the link
links = data['Facebook'].to_list()
account =[0for _ in range(data.shape[0])]
pattern = repile('www.facebook/([a-zA-Z0-9.]+)')
for i in range(len(links)):
try:
name = re.findall(pattern, links[i])[0]
account[i]= name
except:
account[i]=0
posts_data = pd.DataFrame({"post_id":"","text":"","post_text":"","shared_text":"","time":"","image":"","likes":"","comments":"",\
"shares":"","post_url":"","link":""},index=["0"])
abbreviation = data['Company name_abbreviation'].to_list()
incorporation_date = data['Date of Establishment_legal'].to_list()
dissolution_date = data['Dissolved_legal'].to_list()
# Starting to scrap posts
for i in range(0,len(account)):
cnt =0
#There are about 2 posts per page, and pages=4000 should be enough for us to scrap all the Facebook posts since the account was created.
for post in get_posts(account = account[i], pages=4000):
cnt +=1
more_info_post = fetch_share_and_reactions(post)
more_info_post['Company name_abbreviation']= abbreviation[i]
more_info_post['account']= account[i]
more_info_post['incorporation_date']= incorporation_date[i]
more_info_post['dissolution_date']= dissolution_date[i]
df = pd.DataFrame(more_info_post,index=["0"])
posts_data = posts_data.append(df,ignore_index=True,sort=False)
print(account[i],cnt,' posts are scraped.')
useful_columns =['post_id','text','shared_text','time','image','likes','comments','shares','post_url','link',\scraper
'Company name_abbreviation','account','incorporation_date','dissolution_date']
posts_data = pd.DataFrame(posts_data, columns=useful_columns)
posts_data = posts_data.drop([0])
_excel('../data/all_facebook_posts.xlsx',index=False)
return posts_data