Regexp - ClickHouse Documentation

입력	출력	별칭
✔	✗

설명

Regex 포맷은 지정된 정규식에 따라 가져온 데이터의 각 줄을 파싱합니다. 사용법 format_regexp 설정에 지정된 정규식이 가져온 데이터의 각 줄에 적용됩니다. 정규식의 서브패턴 개수는 가져온 데이터셋의 컬럼 수와 같아야 합니다. 가져온 데이터의 각 줄은 개행 문자 '\n' 또는 DOS 스타일 개행 "\r\n"으로 구분되어야 합니다. 일치한 각 서브패턴의 내용은 format_regexp_escaping_rule 설정에 따라 해당 데이터 타입의 메서드로 파싱됩니다. 정규식이 해당 줄과 일치하지 않고 format_regexp_skip_unmatched가 1로 설정되어 있으면, 해당 줄은 별도 알림 없이 건너뜁니다. 그렇지 않으면 예외가 발생합니다.

사용 예시

다음 data.tsv 파일을 살펴보겠습니다:

data.tsv

id: 1 array: [1,2,3] string: str1 date: 2020-01-01
id: 2 array: [1,2,3] string: str2 date: 2020-01-02
id: 3 array: [1,2,3] string: str3 date: 2020-01-03

그리고 imp_regex_table 테이블:

Query

CREATE TABLE imp_regex_table (id UInt32, array Array(UInt32), string String, date Date) ENGINE = Memory;

앞서 언급한 파일의 데이터를 다음 쿼리를 사용해 위의 테이블에 삽입합니다:

Query

$ cat data.tsv | clickhouse-client  --query "INSERT INTO imp_regex_table SETTINGS format_regexp='id: (.+?) array: (.+?) string: (.+?) date: (.+?)', format_regexp_escaping_rule='Escaped', format_regexp_skip_unmatched=0 FORMAT Regexp;"

이제 테이블에서 데이터를 SELECT하여 Regex 포맷이 파일에서 읽은 데이터를 어떻게 파싱했는지 확인할 수 있습니다:

Query

SELECT * FROM imp_regex_table;

Response

┌─id─┬─array───┬─string─┬───────date─┐
│  1 │ [1,2,3] │ str1   │ 2020-01-01 │
│  2 │ [1,2,3] │ str2   │ 2020-01-02 │
│  3 │ [1,2,3] │ str3   │ 2020-01-03 │
└────┴─────────┴────────┴────────────┘

포맷 설정

Regexp 포맷을 사용할 때는 다음 설정을 사용할 수 있습니다.

format_regexp — String. re2 포맷의 정규식을 포함합니다.
format_regexp_escaping_rule — String. 다음 이스케이프 규칙을 지원합니다.
- CSV (CSV와 유사)
- JSON (JSONEachRow와 유사)
- Escaped (TSV와 유사)
- Quoted (Values와 유사)
- Raw (서브패턴을 전체 그대로 추출하며, 이스케이프 규칙이 없고 TSVRaw와 유사)
format_regexp_skip_unmatched — UInt8. format_regexp 표현식이 가져온 데이터와 일치하지 않을 경우 예외를 발생시킬지 여부를 지정합니다. 0 또는 1로 설정할 수 있습니다.

​설명

​사용 예시

​포맷 설정

설명

사용 예시

포맷 설정