langchain_community.utilities.spark_sql.SparkSQL¶

class langchain_community.utilities.spark_sql.SparkSQL(spark_session: Optional[SparkSession] = None, catalog: Optional[str] = None, schema: Optional[str] = None, ignore_tables: Optional[List[str]] = None, include_tables: Optional[List[str]] = None, sample_rows_in_table_info: int = 3)[source]¶

SparkSQL is a utility class for interacting with Spark SQL.

Initialize a SparkSQL object.

Parameters
  • spark_session – A SparkSession object. If not provided, one will be created.

  • catalog – The catalog to use. If not provided, the default catalog will be used.

  • schema – The schema to use. If not provided, the default schema will be used.

  • ignore_tables – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info – The number of rows to include in the table info. Defaults to 3.

Methods

__init__([spark_session, catalog, schema, ...])

Initialize a SparkSQL object.

from_uri(database_uri[, engine_args])

Creating a remote Spark Session via Spark connect.

get_table_info([table_names])

get_table_info_no_throw([table_names])

Get information about specified tables.

get_usable_table_names()

Get names of tables available.

run(command[, fetch])

run_no_throw(command[, fetch])

Execute a SQL command and return a string representing the results.

__init__(spark_session: Optional[SparkSession] = None, catalog: Optional[str] = None, schema: Optional[str] = None, ignore_tables: Optional[List[str]] = None, include_tables: Optional[List[str]] = None, sample_rows_in_table_info: int = 3)[source]¶

Initialize a SparkSQL object.

Parameters
  • spark_session – A SparkSession object. If not provided, one will be created.

  • catalog – The catalog to use. If not provided, the default catalog will be used.

  • schema – The schema to use. If not provided, the default schema will be used.

  • ignore_tables – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info – The number of rows to include in the table info. Defaults to 3.

classmethod from_uri(database_uri: str, engine_args: Optional[dict] = None, **kwargs: Any) SparkSQL[source]¶

Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri(“sc://localhost:15002”)

get_table_info(table_names: Optional[List[str]] = None) str[source]¶
get_table_info_no_throw(table_names: Optional[List[str]] = None) str[source]¶

Get information about specified tables.

Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)

If sample_rows_in_table_info, the specified number of sample rows will be appended to each table description. This can increase performance as demonstrated in the paper.

get_usable_table_names() Iterable[str][source]¶

Get names of tables available.

run(command: str, fetch: str = 'all') str[source]¶
run_no_throw(command: str, fetch: str = 'all') str[source]¶

Execute a SQL command and return a string representing the results.

If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.

If the statement throws an error, the error message is returned.

Examples using SparkSQL¶