我有一段代码,它将相同数据的两个部分附加到 PyArrow 表。第二次写入失败,因为该列被分配了空类型。我明白它为什么这样做。有没有...
我有一段代码,将相同数据的两个部分附加到 PyArrow 表。第二次写入失败,因为该列被分配了 null
类型。我明白它为什么会这样做。有没有办法强制它使用表架构中的类型,而不是在第二次写入时使用从数据推断出的类型?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
data = {
'col1': ['A', 'A', 'A', 'B', 'B'],
'col2': [0, 1, 2, 1, 2]
}
df1 = pd.DataFrame(data)
df1['col3'] = 1
df2 = df1.copy()
df2['col3'] = pd.NA
pat1 = pa.Table.from_pandas(df1)
pat2 = pa.Table.from_pandas(df2)
writer = pq.ParquetWriter('junk.parquet', pat1.schema)
writer.write_table(pat1)
writer.write_table(pat2)
我上面第二次写入时的错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1094, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
col1: string
col2: int64
col3: null
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 578 vs.
file:
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577