试用postgresql的pg_duckdb插件

试用postgresql的pg_duckdb插件 存储库地址https://github.com/duckdb/pg_duckdb/blob/main/README.md拉取pgduckdb镜像并运行容器登录容器aaakylin-pc:~$ sudo docker pull docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 输入密码 18-v1.1.1: Pulling from pgduckdb/pgduckdb 3ac51941b358: Pull complete 8146abf1dff1: Pull complete 8fad3e9ec6e0: Pull complete b6ba526dc589: Pull complete 8a4a7306158c: Pull complete dbfb9c3db61f: Pull complete 3de8d92fb268: Pull complete 8dec8597fc9a: Pull complete ba2d457458a5: Pull complete 34c6fdfe850f: Pull complete ae8eb2cf7a2d: Pull complete ff8919aaa347: Pull complete f05c91f4b5ea: Pull complete a326b33c168a: Pull complete a05b0326ab0a: Pull complete 435f1ce1604d: Pull complete 6c5039352746: Pull complete ce9e9cc819e9: Download complete Digest: sha256:44c88eb9207971af1d7f9804a37429b1bb36f413cd8d7118c81e1288ddde85d1 Status: Downloaded newer image for docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 aaakylin-pc:~$ sudo docker run -d -e POSTGRES_PASSWORDduckdb -v /home/aaa/par:/par --network host --name pgduckdb docker.1ms.run/pgduckdb/pgduckdb:18-v1.1.1 522c80d17fa1a863f14b99c6ac0d8c865a511b5498d9dd6613c04de4be4799be aaakylin-pc:~$ sudo docker exec -it pgduckdb bash postgreskylin-pc:/$ psql psql (18.1 (Debian 18.1-1.pgdg122)) Type help for help.2.测试1查询postgresql表postgres# -- This is a standard PostgreSQL tableCREATETABLEorders(order_idSERIALPRIMARYKEY,product_nameTEXT,amountNUMERIC,order_dateDATE);INSERTINTOorders(product_name,amount,order_date)VALUES(Laptop,1200.00,2024-07-01),(Keyboard,75.50,2024-07-01),(Mouse,25.00,2024-07-02);CREATETABLEINSERT03postgres# SET duckdb.force_execution true;SELECTorder_date,COUNT(*)ASnumber_of_orders,SUM(amount)AStotal_revenueFROMordersGROUPBYorder_dateORDERBYorder_date;SETWARNING:(PGDuckDB/CreatePlan)Prepared query returned an error: Binder Error:Nofunctionmatches the given nameandargumenttypessum(UnsupportedPostgresType(DuckDB requires the precision of a NUMERIC to be set. You can choose to convert these NUMERICs to a DOUBLE by using SETduckdb.convert_unsupported_numeric_to_doubletrue)).You might needtoaddexplicittypecasts.Candidate functions:sum(DECIMAL)-DECIMALsum(BOOLEAN)-HUGEINTsum(SMALLINT)-HUGEINTsum(INTEGER)-HUGEINTsum(BIGINT)-HUGEINTsum(HUGEINT)-HUGEINTsum(DOUBLE)-DOUBLEsum(BIGNUM)-BIGNUM LINE1:SELECTorder_date,count(*)ASnumber_of_orders,sum(amount)AStotal_revenueFROMpgduckdb.public.orders...^order_date|number_of_orders|total_revenue---------------------------------------------2024-07-01|2|1275.502024-07-02|1|25.00(2rows)警告类型绑定失败DuckDB需要设置NUMERIC的精度回退到postgresql引擎执行但可以用设置参数的方法解决。postgres#postgres# SET duckdb.convert_unsupported_numeric_to_double true;SETpostgres# explain SELECTorder_date,COUNT(*)ASnumber_of_orders,SUM(amount)AStotal_revenueFROMordersGROUPBYorder_dateORDERBYorder_date;QUERYPLAN------------------------------------------------------------Custom Scan(DuckDBScan)(cost0.00..0.00rows0width0)DuckDB ExecutionPlan: ┌───────────────────────────┐ │ ORDER_BY │ │ ──────────────────── │ │ orders.order_dateASC│ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ HASH_GROUP_BY │ │ ──────────────────── │ │ Groups:#0 ││ │ │ Aggregates: │ │ count_star()│ │sum(#1) ││ │ │~512rows│ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ order_date │ │ amount │ │ │ │~810rows│ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PGDUCKDB_POSTGRES_SCAN │ │ ──────────────────── │ │Table: orders │ │ │ │ Projections: │ │ order_date │ │ amount │ │ │ │~810rows│ └───────────────────────────┘(40rows)设置duckdb.convert_unsupported_numeric_to_double 以后警告消失执行计划也变成了DuckDB。pg_duckdb插件这么处理的原因是两种数据库numeric类型的范围不同。duckdb文档指出的范围--- https://duckdb.org/docs/current/sql/data_types/numeric#fixed-point-decimals Fixed-Point Decimals The data type DECIMAL(WIDTH, SCALE) (also available under the alias NUMERIC(WIDTH, SCALE)) represents an exact fixed-point decimal value. When creating a value of type DECIMAL, the WIDTH and SCALE can be specified to define which size of decimal values can be held in the field. The WIDTH field determines how many digits can be held, and the scale determines the number of digits after the decimal point. For example, the type DECIMAL(3, 2) can fit the value 1.23, but cannot fit the value 12.3 or the value 1.234. The default WIDTH and SCALE is DECIMAL(18, 3), if none are specified. Addition, subtraction and multiplication of two fixed-point decimals returns another fixed-point decimal with the required WIDTH and SCALE to contain the exact result, or throws an error if the required WIDTH would exceed the maximal supported WIDTH, which is currently 38. Division of fixed-point decimals does not typically produce numbers with finite decimal expansion. Therefore, DuckDB uses approximate floating-point arithmetic for all divisions that involve fixed-point decimals and accordingly returns floating-point data types. Internally, decimals are represented as integers depending on their specified WIDTH. Width Internal Size (bytes) 1-4 INT16 2 5-9 INT32 4 10-18 INT64 8 19-38 INT128 16 Performance can be impacted by using too large decimals when not required. In particular, decimal values with a width above 19 are slow, as arithmetic involving the INT128 type is much more expensive than operations involving the INT32 or INT64 types. It is therefore recommended to stick with a WIDTH of 18 or below, unless there is a good reason for why this is insufficient.postgresql文档指出的范围--- https://www.postgresql.org/docs/current/datatype-numeric.html Table 8.2. Numeric Types Name Storage Size Description Range smallint 2 bytes small-range integer -32768 to 32767 integer 4 bytes typical choice for integer -2147483648 to 2147483647 bigint 8 bytes large-range integer -9223372036854775808 to 9223372036854775807 decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point ...postgresql的numeric最大能保存131072位而DuckDB是38位如果不指定精度postgresql的numeric类型数据在duckdb就可能溢出。删除旧表用带精度的DECIMAL重新建表把duckdb.convert_unsupported_numeric_to_double设置为false查询也没有警告了。postgres# SET duckdb.convert_unsupported_numeric_to_double false;SETpostgres# drop table orders;DROPTABLEpostgres# CREATE TABLE orders (order_idSERIALPRIMARYKEY,product_nameTEXT,amountDECIMAL(15,2),order_dateDATE);CREATETABLEpostgres# INSERT INTO orders (product_name, amount, order_date)VALUES(Laptop,1200.00,2024-07-01),(Keyboard,75.50,2024-07-01),(Mouse,25.00,2024-07-02);INSERT03postgres# SELECTorder_date,COUNT(*)ASnumber_of_orders,SUM(amount)AStotal_revenueFROMordersGROUPBYorder_dateORDERBYorder_date;order_date|number_of_orders|total_revenue---------------------------------------------2024-07-01|2|1275.502024-07-02|1|25.00(2rows)测试2duckdb独有的函数postgres# select * from range(1,3);ERROR:functionrange(integer,integer)doesnotexist LINE1:select*fromrange(1,3);^HINT:Nofunctionmatches the given nameandargumenttypes.You might needtoaddexplicittypecasts.postgres# \timing onTimingison.postgres# select sum(i) from generate_series(1,10000000)t(i);sum----------------50000005000000(1row)Time:33.870ms postgres# explain select sum(i) from generate_series(1,10000000)t(i);QUERYPLAN------------------------------------------------------------Custom Scan(DuckDBScan)(cost0.00..0.00rows0width0)DuckDB ExecutionPlan: ┌───────────────────────────┐ │ UNGROUPED_AGGREGATE │ │ ──────────────────── │ │ Aggregates:sum(#0) │└─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ i │ │ │ │~10,000,000rows│ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ GENERATE_SERIES │ │ ──────────────────── │ │Function: │ │ GENERATE_SERIES │ │ │ │~10,000,000rows│ └───────────────────────────┘(25rows)DuckDB独有的range函数不能用。两者都有的generate_series函数可以用自动采用DuckDB引擎。postgres# select * from /par/tpch1/region.csv;ERROR: syntax error atornear/par/tpch1/region.csvLINE1:select*from/par/tpch1/region.csv;^postgres# select * from read_csv(/par/tpch1/region.csv);r_regionkey|r_name|r_comment-----------------------------------------------------------------------------------------------------------------------------------------------0|AFRICA|ar packages.regular excuses among the ironic requests cajole fluffily blithely final requests.furiously express p1|AMERICA|s are.furiously even pinto bea2|ASIA|c,special dependencies around3|EUROPE|e dolphins are furiously about the carefully4|MIDDLE EAST|foxes boost furiously along the carefully dogged tithes.slyly regular orbits accordingtothe special epit(5rows)read_csv虽然是duckdb独有的但已经在文档https://github.com/duckdb/pg_duckdb/blob/main/docs/functions.md 中列出就可以用但不明确写read_csv函数只写文件路径名的简写方式不支持。postgres# set duckdb.force_executionfalse;SETpostgres# explain select * from read_csv(/par/tpch1/region.csv);QUERYPLAN------------------------------------------------------------Custom Scan(DuckDBScan)(cost0.00..0.00rows0width0)DuckDB ExecutionPlan: ┌───────────────────────────┐ │ READ_CSV │ │ ──────────────────── │ │Function: READ_CSV │ │ │ │ Projections: │ │ r_regionkey │ │ r_name │ │ r_comment │ │ │ │~27rows│ └───────────────────────────┘(17rows)postgres# explain select sum(i) from generate_series(1,10000000)t(i);QUERYPLAN----------------------------------------------------------------------------------------Aggregate(cost125000.00..125000.01rows1width8)-FunctionScanongenerate_series t(cost0.00..100000.00rows10000000width4)JIT: Functions:5Options: Inliningfalse,Optimizationfalse,Expressionstrue,Deformingtrue(5rows)postgres# \timing onTimingison.postgres# select sum(i) from generate_series(1,10000000)t(i);sum----------------50000005000000(1row)Time:1479.601ms(00:01.480)明确设置duckdb.force_execution为false仅在文档中列出的read_csv走duckdb引擎generate_series走了postgresql。