3 回答

TA貢獻(xiàn)1789條經(jīng)驗 獲得超8個贊
這種方法使用 pyparsing 來實際解析 SQL 語句(在示例中顯示的子集語法中):
import pyparsing as pp
ppc = pp.pyparsing_common
ident = ppc.identifier
CREATE, INSERT, INTO, TABLE, AS, SELECT, FROM, JOIN = \
map(pp.CaselessKeyword, "CREATE INSERT INTO TABLE AS SELECT FROM JOIN".split())
select_stmt = (SELECT
+ (pp.delimitedList(ident) | '*')("columns")
+ FROM
+ ((ident + JOIN.suppress() + ident) | ident)("tables"))
src_target_stmt = ((CREATE + TABLE | INSERT + INTO)("action")
+ ident("target")
+ AS
+ pp.Group(select_stmt)("source"))
tests = """
CREATE TABLE TAR_TAB1 AS SELECT * FROM SRC_TAB1 JOIN SRC_TAB2
CREATE TABLE TAR_TAB2 AS SELECT * FROM SRC_TAB3 JOIN SRC_TAB4
INSERT INTO TAR_TAB3 AS SELECT COL1,COL2 FROM SRC_TAB5 JOIN SRC_TAB6
"""
# useful for debugging
#src_target_stmt.runTests(tests)
# dump parsed values out as CSV output
for t in tests.splitlines():
if not t.strip():
continue
result = src_target_stmt.parseString(t)
target = result.target
action = result.action[0]
for src in result.source.tables:
print("{},{},{}".format(action, target, src))
印刷:
CREATE,TAR_TAB1,SRC_TAB1
CREATE,TAR_TAB1,SRC_TAB2
CREATE,TAR_TAB2,SRC_TAB3
CREATE,TAR_TAB2,SRC_TAB4
INSERT,TAR_TAB3,SRC_TAB5
INSERT,TAR_TAB3,SRC_TAB6
隨著您繼續(xù)您的項目并發(fā)現(xiàn)新的需求(需要提取 SQL 操作,如本例所示,或者您發(fā)現(xiàn)您需要在解析中理解的 SQL 結(jié)構(gòu)的其他變體),擴(kuò)展解析器將更容易和更易于維護(hù)比擴(kuò)展正則表達(dá)式。

TA貢獻(xiàn)1863條經(jīng)驗 獲得超2個贊
這是一個解決方案:
targets = re.findall(r'(?:CREATE\s+TABLE|INSERT\s+INTO)\s+([a-z0-9A-Z_]+)\s+AS', text)
sources = re.findall(r'SELECT\s+\*\s+FROM\s([a-z0-9A-Z_]+)\s+JOIN\s+([a-z0-9A-Z_]+)', text)
# each target has multiple sources, so repeat each target n times per
# number of sources.
lens = [len(src) for src in sources]
targets = np.repeat(targets, lens)
# 'flatten' the list of sources from [(t1, t2), (t3, t4)] to
# [t1, t2, t3, t4]
sources = [tab for exp in sources for tab in exp]
pd.DataFrame({"src": sources, "tgt": targets})
結(jié)果:
src tgt
0 SRC_TAB1 TAR_TAB1
1 SRC_TAB2 TAR_TAB1
2 SRC_TAB3 TAR_TAB2
3 SRC_TAB4 TAR_TAB2
4 SRC_TAB5 TAR_TAB3
5 SRC_TAB6 TAR_TAB3

TA貢獻(xiàn)1844條經(jīng)驗 獲得超8個贊
| : A|B,其中 A 和 B 可以是任意 RE,創(chuàng)建一個匹配 A 或 B 的正則表達(dá)式。
import re
import pandas as pd
Q1 = 'CREATE TABLE TAR_TAB1 AS SELECT * FROM SRC_TAB1 JOIN SRC_TAB2'
Q2 = 'CREATE TABLE TAR_TAB2 AS SELECT * FROM SRC_TAB3 JOIN SRC_TAB4'
Q3 = 'CREATE TABLE TAR_TAB3 AS SELECT * FROM SRC_TAB5 JOIN SRC_TAB6'
requests = [Q1, Q2, Q3]
target_filter = r'\w+(?=\s+AS)'
source_filter1 = r'(?<=FROM )\S+'
source_filter2 = r'(?<=JOIN )\S+'
regex_filter = target_filter + '|' + source_filter1 + '|' + source_filter2
results = [re.findall(regex_filter, Q) for Q in requests]
print(results)
# [['TAR_TAB1', 'SRC_TAB1', 'SRC_TAB2'], ['TAR_TAB2', 'SRC_TAB3', 'SRC_TAB4'], ['TAR_TAB3', 'SRC_TAB5', 'SRC_TAB6']]
要轉(zhuǎn)換數(shù)據(jù)框中的結(jié)果列表:
df = pd.DataFrame(results, columns =['target', 'source1', 'source2'])
target source1 source2
0 TAR_TAB1 SRC_TAB1 SRC_TAB2
1 TAR_TAB2 SRC_TAB3 SRC_TAB4
2 TAR_TAB3 SRC_TAB5 SRC_TAB6
或者,如果您想以不同的方式呈現(xiàn)它:
l = []
for res in results:
l.append([res[0], res[1]])
l.append([res[0], res[2]])
df = pd.DataFrame(l, columns =['target', 'source'])
接著 :
target source
0 TAR_TAB1 SRC_TAB1
1 TAR_TAB1 SRC_TAB2
2 TAR_TAB2 SRC_TAB3
3 TAR_TAB2 SRC_TAB4
4 TAR_TAB3 SRC_TAB5
5 TAR_TAB3 SRC_TAB6
添加回答
舉報