pyspark.SparkContext.textFile¶
-
SparkContext.
textFile
(name: str, minPartitions: Optional[int] = None, use_unicode: bool = True) → pyspark.rdd.RDD[str][source]¶ Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8.
New in version 0.7.0.
- Parameters
- namestr
directory to the input data files, the path can be comma separated paths as a list of inputs
- minPartitionsint, optional
suggested minimum number of partitions for the resulting RDD
- use_unicodebool, default True
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode.
New in version 1.2.0.
- Returns
RDD
RDD representing text data from the file(s).
Examples
>>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... path1 = os.path.join(d, "text1") ... path2 = os.path.join(d, "text2") ... ... # Write a temporary text file ... sc.parallelize(["x", "y", "z"]).saveAsTextFile(path1) ... ... # Write another temporary text file ... sc.parallelize(["aa", "bb", "cc"]).saveAsTextFile(path2) ... ... # Load text file ... collected1 = sorted(sc.textFile(path1, 3).collect()) ... collected2 = sorted(sc.textFile(path2, 4).collect()) ... ... # Load two text files together ... collected3 = sorted(sc.textFile('{},{}'.format(path1, path2), 5).collect())
>>> collected1 ['x', 'y', 'z'] >>> collected2 ['aa', 'bb', 'cc'] >>> collected3 ['aa', 'bb', 'cc', 'x', 'y', 'z']