Pyspark – Convert CSV File to Nested Struct

问题内容:

I have a dataframe loaded by a csv file. This CSV contains data about a recommender system.

The struct is:

SKU, 
TYPE_SKU, 
TYPE_RECOMMENDED, 
SKU_RECOMMENDED_1, 
TYPE_SKU_RECOMMENDED_1,
SKU_RECOMMENDED_2, 
TYPE_SKU_RECOMMENDED_2,
SKU_RECOMMENDED_3, 
TYPE_SKU_RECOMMENDED_3,
SKU_RECOMMENDED_4, 
TYPE_SKU_RECOMMENDED_4

So, I have 4 levels of recommended that I need to struct in json/nested struct in the following struct below, using Pyspark.

{
     sku: 1
     type_sku: 'Service'
     type_recommender:'BUY TOGHETER'
     listOfRecommender: [
         {
          sku:123,
          type_sku: 'Merchandise'
         },
         {
          sku:124,
          type_sku: 'Merchandise'
         },
         {
          sku:4987,
          type_sku: 'Service'
         }
    ]
},
{
     sku: 2
     type_sku: 'Merchandise'
     type_recommender:'Another One'
     listOfRecommender: [
         {
          sku:123,
          type_sku: 'Merchandise'
         },
         {
          sku:124,
          type_sku: 'Merchandise'
         },
         {
          sku:4987,
          type_sku: 'Service'
         }
    ] 
}

Any help will be appreciated

问题评论:

    
What have you done so far? If nothing. Go through the documentation you can find and then come back.
    
df = spark.read.csv(“file.csv”, header=True, mode=”DROPMALFORMED”, sep=”;”) def get_nested_rec(key, grp): rec = {} rec[‘SKU’] = key[0] rec[‘TYPE_SKU’] = key[1] rec[‘TYPE_RECOMMENDER’][2] for field in [‘SKUS’]: income_types = list(grp[field].unique()) rec[‘listRecommender’] = income_types return rec records = [] for key, grp in df.groupby([‘SKU’,’TYPE_SKU’,’SKUS’]): rec = get_nested_rec(key, grp) records.append(rec) records = dict(data = records)
– Marcus Vinicius
2 hours ago

原文地址:

https://stackoverflow.com/questions/47753732/pyspark-convert-csv-file-to-nested-struct

添加评论