๊ทธ๋ํ ์๋ฒ ๋ฉ์ ๊ณต๋ถํ๊ธฐ ์ํ DataSet์ผ๋ก DBLP๋ก ์ ํ๊ณ ์ด๋ฅผ ๊ฐ์ ธ์๋ณด์๋ค.
https://www.aminer.org/citation
์ด ๊ณณ์ ๋ค์ด๊ฐ์
์ด ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์์ ๋ค์ด๋ก๋๋ฅผ ๋ฐ์๋ค.
๊ทธ๋ฐ๋ฐ ๋ฌธ์ ๋ ์ด ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์์ ์ ์ฒ๋ฆฌ๋ฅผ ํด์ผํ๋๋ฐ ์ฉ๋์ด 16.1GB ์ด๋ค..
์ฌ๋งํ ์๋ํฐ๋ก ์ด๋ฆฌ์ง๋ ์๋ ๋ฐ์ดํฐ๋ฅผ ์ฒ๋ฆฌํด์ผํด์ ๋ง๋งํ์๋ค.
๊ทธ๋์ ์๊ฐํ ๊ฒ์ด ๋ฐ์ดํฐ๋ฅผ ์ฉ๋์ ์ ํด์ ์๋ฅด๊ณ ,
์๋ฅธ ์ฝ๋๋ฅผ ์์์ ์ผ๋ก ์กฐ๊ธ๋ง ์๋ด์ฃผ์๊ณ ์๊ฐํ์๋ค.
๋ด๊ฐ ์ฌ์ฉํ ํ๋ก๊ทธ๋จ์ GSplit 3 ์ด๋ค.
์ฌ๊ธฐ์ ๊ฐ์ ธ์จ DBLP Jsonํ์ผ์ ๊ฐ์ ธ์์ 1GB์ฉ ๋จผ์ ์๋๋ค.
์ด๋ ๊ฒ ๋๋ฉด, ๋์ ๋๋ฆฌ๋ก ์๋ฅด๋ ๊ฒ์ด ์๋ ์ฉ๋์ผ๋ก ์๋ฅด๊ธฐ์ Json ํ์์ด ๊นจ์ง๊ฒ ๋๋ค.
๋ฐ๋ผ์, ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ์ํด ์๋ค ํ์ผ์ ์๋ค๊ฐ๋คํ๋ฉด์ ์๋ฆฐ Json ์ค์ ์์ ํด์ผํ๋ค.
์ ๋ Liquid Studio๋ฅผ ์ค์นํ์ฌ ๋ฐ์ดํฐ๋ฅผ ์์ ํ์์ต๋๋ค.
VScode๋ Pycharm์ด๋ ๋ ธํธํจ๋๋ ๋ค๋ฅธ ์๋ํฐ์์ ๋์ฉ๋ํ์ผ์ ์์ ํ ์ ์์ด์ ์ด ํ๋ก๊ทธ๋จ์ ์ฌ์ฉํ์ฌ ์์ ํ์์ต๋๋ค.
ํ์ผ์ ์ฎ๊ฒจ๋ค๋๋ฉด์ ํ 20๋ถ ํฌ์ํด์ 17๊ฐ์ ๋ถํ ๋ ๋ฐ์ดํฐ๋ฅผ JSONํ์์ ๋ง๊ฒ ์์ ์ ํด์ฃผ์์ต๋๋ค.
์ด์ ์ด ๋ถํ ๋ 17๊ฐ์ json์ ์ฌ์ฉํ ์ ์๊ฒ ๋ฉ๋๋ค.
DBLP v13์ ์ด๋ฐ์์ผ๋ก ๋์ด์๋ค๊ณ ํ๋๋ฐ, ์ค์ ๋ก ๋ค๋ฅธ ๊ฒ๋ค์ด ๋ช๊ฐ ์์ด์ ์ด๋ฅผ ๋จผ์ ์ฐพ๊ณ ์์ ํ์์ต๋๋ค.
์ด ๋ฐ์ดํฐ๋ฅผ ์ด๋ ค๊ณ ํ๋ฉด ์ค๊ฐ์ ์ด๋ฐ ๋ฌธ๊ตฌ๊ฐ ๊ปด์์ด์ ์์ด๋ฆฝ๋๋ค.
๋ฐ๋ผ์ ์ด๋ฐ์์ผ๋ก ์ฒ๋ฆฌ๋ฅผ ํ์ฌ ํ์ผ์ ์ด์ด์ผํฉ๋๋ค.
with open(path+str(i+1)+".json", 'r', encoding='utf-8') as f:
data = f.read()
fixed_data = re.sub(r"NumberInt\((\d+)\)", r"\1", data) #NumberInt(0) ๋ณํํ๊ธฐ
load_data = json.loads(fixed_data)
print("parse_json result: %s" % type(data))
3๋ฒ์งธ ์ค์ด ์ ๊ธฐ ์๋ NumberInt๋ฅผ ์นํํด์ฃผ๋ ์ฝ๋์ ๋๋ค.
์ด๋ฅผ ๊ฐ์ง๊ณ ์๋ฏธ์๋ Key๊ฐ๋ง ๋ฝ์๋ด๊ธฐ ์ํด ์ ์ฒ๋ฆฌ๋ฅผ ํ์์ต๋๋ค.
doi๋ volume๊ฐ์ ๊ฑด ํ์์๋ ์ ๋ณด๋ผ๊ณ ํ๋จ๋์ด ์ด๋ฐ์์ผ๋ก ๋ถ๋ฅ์์ ์ ํ ํ ์ ์ฒ๋ฆฌ๋ฅผ ํ์์ต๋๋ค.
๊ฒฐ๋ก ์ ์ผ๋ก ๋ฝ์ ํค๊ฐ๋ค๊ณผ ์ ๋ณด๋ ์๋์ ๊ฐ์ต๋๋ค.
"""
_id str
title str
year int
keywords list of str
fos list of str
references list of str
n_citation int
abstract str
authors._id list of str
authors.name list of str
authors.org list of str
authors.orgid list of str
venue.sid str
venue.raw str
14๊ฐ์ Key
122783๊ฐ์ Dictonory
374 MB
"""
๊ธฐ์กด ๋ฐ์ดํฐ์์ ์ญ์ ํ๋ ๋ฐฉ์์ด ์๋,
์๋ก์ด ๋ฆฌ์คํธ์ ๋์ ๋๋ฆฌ๋ฅผ ๋ง๋ค์ด์ ์ํ๋ ์์์ key-value ๊ฐ์ผ๋ก ์ค์ ํ์ฌ ์ถ๊ฐํ์ฌ ์ ์ํ์์ต๋๋ค.
๋จผ์ ํ์ํ Key ๊ฐ๋ค์ด ์ ๋ถ ์๋์ง ํ์ธ ํ,
ํ๋๋ผ๋ ๋น ์ ธ์๊ฑฐ๋, ๊ฒฐ์ธก๋ฐ์ดํฐ๊ฐ ์์ผ๋ฉด ์ ๋ถ ๋์ด๊ฐ๊ณ ,
ํ์ํ ์ ๋ณด๋ค์ด ์ ๋ถ ์กด์ฌํ๋ฉด ๋ฆฌ์คํธ์ ์ถ๊ฐํ์ฌ ์ด๋ฅผ jsonํ์ผ๋ก ๋ฐ๊พธ๋ ๋ฐฉ์์ ์ฌ์ฉํ์์ต๋๋ค.
์ต์ข ์ ์ผ๋ก ์ด๋ฌํ ํ์์ jsonํ์ผ์ ์ ์ํ์์ต๋๋ค.
[
{
"_id": "53e9978db7602d9701f50739",
"title": "iCity",
"year": 2007,
"keywords": [
"Irregular cellular automata (CA)",
"Complex systems modelling",
"Geographic information systems (GIS)",
"Urban planning",
"Modelling urban growth",
"ArcGIS",
"Visual Basic .NET 2003",
"ArcObjects"
],
"fos": [
"Software tool",
"Asynchronous communication",
"Cellular automaton",
"Geographic information system",
"Systems engineering",
"ArcObjects",
"Computer science",
"Familiar environment",
"Urban planning",
"Predictive modelling"
],
"references": [
"53e9ac54b7602d9703624f16",
"53e99aedb7602d9702379fcc",
"53e9ae4eb7602d9703864863",
"53e99dccb7602d970268a788",
"53e9b7b4b7602d970435d6a5",
"53e9b923b7602d97045101f6",
"53e9b3b2b7602d9703e97364",
"599c7a31601a182cd2699edc",
"53e9be14b7602d9704acbec1",
"53e99caeb7602d970255df3a",
"5c7916064895d9cbc61c9f6f",
"53e99fa8b7602d970287d2bf",
"53e9b195b7602d9703c1d036",
"53e9ba95b7602d97046bbc22",
"53e9b9d3b7602d97045ca1a4",
"53e9b607b7602d970415d4bd",
"53e9bb6db7602d97047b1b94",
"53e9a317b7602d9702c227ea",
"53e9ade2b7602d97037ed1d6"
],
"n_citation": 2,
"abstract": "The objective of this study is to present a novel tool for predictive modelling of urban growth. The proposed tool, named iCity \u2013 Irregular City, extends the traditional formalization of cellular automata (CA) to include an irregular spatial structure, asynchronous urban growth, and a high spatio-temporal resolution to aid in spatial decision making for urban planning. The iCity software tool was developed as an embedded model within a common desktop geographic information system (GIS) with a user-friendly interface to control modelling operations for urban land-use change. This approach allows the model developer to focus on implementing model logic rather than developing an entire stand-alone modelling application. It also provides the model user with a familiar environment in which to run the model to simulate urban growth.",
"authors._id": [
"53f43531dabfaeb2ac048cb0",
"54329aafdabfaeb4c6a92375",
"53f39faadabfae4b34ab1ca2"
],
"authors.name": [
"D. Stevens",
"S. Dragicevic",
"Kristina Rothley"
],
"authors.org": [
"Spatial Analysis and Modeling Laboratory, Department of Geography, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A1S6",
"Spatial Analysis and Modeling Laboratory, Department of Geography, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A1S6",
"School of Resource and Environmental Management, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A1S6"
],
"authors.orgid": [
"5f71b2841c455f439fe3c68e",
"5f71b2841c455f439fe3c68e",
"5f71b2841c455f439fe3c68e"
],
"venue.sid": "environmental-modelling-and-software",
"venue.raw": "Environmental Modelling & Software"
}
]
์ด์ ์ด ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ง๊ณ Network, ๊ทธ๋ํํ์์ผ๋ก ์ ์ํ์ฌ ์ฌ์ฉํ๋ฉด ๋ฉ๋๋ค.