Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for the overriding of stringify_dict for json export format on BaseSQLToGCSOperator #26277

Merged
merged 1 commit into from
Sep 18, 2022

Conversation

patricker
Copy link
Contributor

@patricker patricker commented Sep 9, 2022

closes: #26273

This change allows you to dump dict type objects returned from a database to a string. Schema generation already labels them as strings (at least from Postgres).

Currently JSON type columns are hard to ingest into BQ since a JSON field in a source database does not enforce a schema, and we can't reliably generate a RECORD schema for the column.

No change to default behavior, must be enabled by setting stringify_dict=True

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Sep 9, 2022
@Perados
Copy link
Contributor

Perados commented Oct 3, 2022

Hey @patricker , any reason why this fix was not applied for csv and parquet files? Currently, there is no way to stringify rows for those formats.

@patricker
Copy link
Contributor Author

Hey @Perados, I didn't do Parquet because I wasn't prepared to test it. There are a number of other issues going on with the Parquet export format, it needs quite a bit of work.

As for CSV, I believe it already does this by default. I did some testing and in those tests dictionaries were being stringified by the unicodecsv library automatically.

@patricker patricker deleted the SQLToGCSAddDumpJSON branch October 3, 2022 14:37
@sleepy-tiger
Copy link
Contributor

sleepy-tiger commented Oct 4, 2022

Hey @patricker, may I know why we hardcode stringify_dict = False in convert_types function?

def convert_types(self, schema, col_type_dict, row, stringify_dict=False) -> list:
"""Convert values from DBAPI to output-friendly formats."""
return [
self.convert_type(value, col_type_dict.get(name), stringify_dict=stringify_dict)
for name, value in zip(schema, row)
]

Due to this, the implementation of convert_type function in PostgresToGCSOperator, the stringify_dict default value does not take effect:

def convert_type(self, value, schema_type, stringify_dict=True):
"""
Takes a value from Postgres, and converts it to a value that's safe for
JSON/Google Cloud Storage/BigQuery.
Timezone aware Datetime are converted to UTC seconds.
Unaware Datetime, Date and Time are converted to ISO formatted strings.
Decimals are converted to floats.

    :param value: Postgres column value.
    :param schema_type: BigQuery data type.
    :param stringify_dict: Specify whether to convert dict to string.
    """
    if isinstance(value, datetime.datetime):
        iso_format_value = value.isoformat()
        if value.tzinfo is None:
            return iso_format_value
        return pendulum.parse(iso_format_value).float_timestamp
    if isinstance(value, datetime.date):
        return value.isoformat()
    if isinstance(value, datetime.time):
        formatted_time = time.strptime(str(value), "%H:%M:%S")
        time_delta = datetime.timedelta(
            hours=formatted_time.tm_hour, minutes=formatted_time.tm_min, seconds=formatted_time.tm_sec
        )
        return str(time_delta)
    if stringify_dict and isinstance(value, dict):
        return json.dumps(value)
    if isinstance(value, Decimal):
        return float(value)
    return value

Hence the dict object won't be stringified.

@patricker
Copy link
Contributor Author

@sleepy-tiger sorry, I don't know. It was already like that when I made my changes.

Please file an issue with the details of the problem you are having.

@eladkal
Copy link
Contributor

eladkal commented Oct 6, 2022

Followup PR #26876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQLToGCSOperators Add Support for Dumping JSON
6 participants
  翻译: