Recursive search with Python
Fri Jun 29, 2018

The Requirements

Recently I received from JSON like data that I needed to transform into a tabular dataset. As part of that there was a specific key that could occur as a child of different keys at different depths in the structure. Not only could the key I needed appear at different locations and depths, but when it was located it was possible that it would have N sibling occurrences I needed to retrieve at the same location. Finally for all of these there were a set of id and date keys at the top level of the structure that I was asked to include with each search key result.

I took a couple different paths on my way to solving this. One of the first things I found was the total depth was inconsistent across the structures. Not only that, but it wasn't uncommon to find the key scattered across depths up to 5 or 6 levels deep. The function below is what I ended up using. It's a recursive search that relies on the fact that the data is JSON like. Instead of trying to pull the parent keys out as part of the search I have a function that parses out the id and date keys passing those into this function as base. Then a search is performed on the input object checking the dictionary collections for all instances of the search key and when located appending the search keys value to the base data, which is then added to a list of results which is returned when the entire collection has been searched.

Gotchas

Problem Solved

The function below represents my final result. This worked well on the sample data, and eventually was used on PySpark RDDs to process hundreds of millions of structures quickly.

import copy
from future.utils import iteritems

def search(input, row_base, search_key, results):

    """
 A search function to help transform nested JSON like objects into tabular rows.

 The function takes a JSON like object as input along with a search key and returns a row
 for each occurrence of the key in the object.

 row_base is expected to be a list containing any base data you would like associated
 with the search_key data.
 """

    if input:
        for i in input:
            # If input contains a list run it through search again since it
            # may contain dictionaries with the key being searched 
            if isinstance(i, list):
                search(i, row_base, search_key, results)
            # If input contains a dictionary check if it contains the search_key
            # Also check if any of the values are list or dictionaries that need to be searched
            if isinstance(i, dict):
                for k, v in iteritems(i):
                    # If the search_key is located deepcopy row_base to prevent changing
                    # row_base on future hits. Create full row and append to results
                    if k == search_key:
                        row = copy.deepcopy(row_base)
                        row.append(i)
                        results.append(row)
                        continue
                    elif isinstance(v, list):
                        search(v, row_base, search_key, results)
                    elif isinstance(v, dict):
                        search(v, row_base, search_key, results)

    # Search has been exhausted return search results to caller.
    # Results will be a list of list.
    return results

Next Steps

Since this works there are a couple of ideas I want to explore with it.

Docs

For more information documentation on copy and future you can check out the documentation.

Contact

I'm sure others will have different ideas and approaches to something like this. Or you might have suggestions on something that could be done to make this faster or easier to read. If you have feedback or suggestion feel free to send them my way via up via email.


blog · about · sourcehut · hackaday · home