如何去除雙引號之間的特殊字元(gwak, C 語言)

3 min readFeb 2, 2021

這次的需求是需要把放在 postgres 上的 access log 轉移到 hdfs 上

但是…

psql 讀出來會出現特殊字元，如果直接用 hive 導進 fdfs 內，會出現奇怪的 index
ex:

[input]
"CREATE External TABLE AAA_STG.OOO_TW_RM_BATCH_SETTING_INFO(
As_Of_Date                                         STRING
,Cip_X_Score                                        INT
,Cip_Y_Score                                        INT
)
STORED AS RCFILE"[result]
index  field
1      CREATE External TABLE AAA_STG.OOO_TW_RM_BATCH_SETTING_INFO(
2      As_Of_Date                                         STRING
3      ,Cip_X_Score                                        INT
...

本來想直接用 sed 來解決問題，但是發現沒這麼好用

另外查到可以使用 gawk:

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' file

但是後來發現當量很大的時候(幾十GB以上時)，效率會變得非常慢，所以直接用 C 語言來實作:

while (fgets(in_str, FGETS_SIZE, in_f) != NULL) {
    tmp = in_str;
    while (1) {
        quote_index = strstr(tmp, "\"");
        if (!quote_index)
        break;        if (!has_quote) {
            has_quote = true;
        } else {
            has_quote = false;
            for (int i = 0; i < (quote_index - tmp); i++) {
                if (tmp[i] == '\n' || tmp[i] == '\r' ||
                    tmp[i] == '\t')
                    tmp[i] = ' ';
            }
        }
        tmp = quote_index + 1;
    }    if (has_quote) {
        for (int i = 0; i < strlen(tmp); i++) {
            if (tmp[i] == '\n' || tmp[i] == '\r' || tmp[i] == '\t')
                tmp[i] = ' ';
        }
    }
    
    fprintf(out_f, "%s", in_str);
}

compile 只有下-O2 的效能差距大概差了10倍左右(約 200 vs 20 秒)，提供給大家參考囉~

ref:

https://stackoverflow.com/questions/29150640/how-to-remove-new-lines-within-double-quotes

Written by starzodiac

No responses yet