PodcastPixel
Flirting with Models

What does a full-stack quant research platform and process look like?

13 Feb 2023 20 min Featuring: Chris Meredith Jump to transcript
Flirting with Models

Listen to episode

Episode Summary

In this episode of 'Flirting with Models,' host Cory Hofstein revisits a conversation with Chris Meredith, co-chief investment officer at O'Shaughnessy Asset Management, focusing on the importance of research processes in finance. They discuss the evolution of data acquisition, the integration of new technologies, and the balance between raw and processed data. The episode emphasizes the significance of maintaining a robust research platform and the challenges of managing operational efficiency while incorporating innovative tools and methodologies.

Key Topics

Research Processes Data Acquisition Machine Learning Quantitative Techniques Operational Efficiency Vendor Management Research Graveyard Investment Strategies

Full Transcript

Download .txt

Hey everyone, Cory here. Thanks for tuning into another episode of flirting with models if you're enjoying the show I'd greatly appreciate it If you take a moment to rate, review, and most importantly share with a friend. Word of mouth is how this podcast grows and if you'd like to learn more about new found's platform of return stacked mutual funds, ETFs and model portfolios head over to return stacks calm now on with the show.

What's up, everybody long time listeners know that this podcast is a labor of pure love for me. So my only ask is that you help me spread the word about it, rate review and share this podcast. It'll take you just a few seconds or a quick tap of the thumb, but it would mean the world to me.

Hello and welcome everyone, I'm Cory Hofstein. And this is flirting with models, the podcast that pulls back the curtain to discover the human factor behind the quantitative strategy. Corey Hofstein is the co-founder and chief investment officer of new found research due to industry regulations. He will not discuss any of new found researches funds on this podcast, all opinions expressed by podcast participants are solely their own opinion. And do not reflect the opinion of new found research. This podcast is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of new found research may maintain positions in securities discussed in this podcast, for more information. Visit think newfound.com.

In our industry, we are all too often guilty of asking "what is your alpha" rather than "what is your process for finding alpha", yet in the long run, it is the process that is important. I'm equally guilty of this, in the history of this podcast. I've probably overemphasized the outcome of research versus the process of research. There are a few exceptions though. And in this dive into the archives, I want to return to season two when I spoke with Chris Meredith, co-chief investment officer at O'Shaughnessy Asset Management. There are a lot of nuggets in this episode, ranging from ingesting data, to working with research partners, to a discussion of hardware setup. But the part that's always stuck with me the most was Chris's process for prioritizing research proposals based upon an AUM scaled information ratio. I'll let Chris explain, enjoy.

And I know you spend a lot of time not just in the weeds research itself, but also thinking about the research platform as a whole enabling your team to do research in a more, not just thoughtful but expeditious manner, making sure they have the tools at their fingertips. For you, what does a full stack research platform look like today? And how has that picture evolved say, going back to the earliest days of you just scraping yahoo data? First of all, it's better than scraping yahoo data. I'll start with that.

Those were terrible models. But it was the idea is I separate into three parts of a research platform. There's obviously the data that you're going to aggregate as much information as possible, build a mosaic of publicly available information around a company. There's the tools like the computers, the code stacks of how you go about analyzing that data. So data plus analysis we're putting together, and then there's the people, the team itself who's going about utilizing the tools and the data to reach conclusions on how it goes through. The part of those three and how they've evolved over time is everything is heading towards more sophistication.

So for data, the breadth and depth of data available now is much, much higher than what we've seen in the past. So when you started back in 2004, or even back at Cornell, you were dealing essentially with just copy set for the financial statements. You're dealing with pricing for CRSP and then you're dealing with IBIS possibly for analyst estimates along the way. And those are data that's going to go back to the 1960s on a quarterly basis for the financials and 1920s for the pricing.

What we've seen is just that there's just because of what's happened with computing. The growth of it, Moore's law, every 18 months doubling, what happened with storage costs going down is just this explosion of data. So it started coming out with originally some structured data sets, things like auction metrics and you're looking at derivatives data. You're looking at TRACE starting up in 2000 where you're getting the corporate bond data. That's coming through. All these things like ownership data now where your people are taking the 13F filings and structuring those so you can know which mutual funds are owning which company on a certain quarterly basis.

And building those, they're building out professionals data to know who's the executives are, what's the executive compensation? What's their tenure? What's the board? What's their makeup? And so then there's all this call of broadening of it, but there's also like these pointy data sets that are coming out as well. So things like the ESG data, which is like you have Sustainalytics or the MCI data, which is coming out and telling you what is the value based or the risk based portion of? Certain of the UN PRI is a 17 structure. They put out there for things that you should look at for sustainability and ESG awareness, you're seeing other ones that come through where they're talking about supply chain data.

So interesting one was S&P comes through there's a lot of vendor management on this for data. So there's one where S&P comes through and they say they've got a new data set now that's saying they're looking at every shipping manifest coming into the country. And they're saying what's going through on the product of what's loaded into that cargo bin. Where is it headed?

What's the address it's coming from and where it's going to so there's these interesting Pointy data sets like I said ones that have very specific Application and you have to weigh and judge what's the cost benefit of those and how you're going to integrate them over time. So on top of that, on the data side, there's also unstructured data now.

So and this is where it gets really interesting, which is this part about how we are incorporating in more and more Data that used to not be available for analysis. So the parts that I like to talk through are just things that were originally in books. So a recent exercise we did was we scraped all the old moody's manuals and we built financials for sales and earnings back to 1925 just to start seeing if There's anything interesting that comes from a great depression and analyzing value.

So that's one where you physically you scrape the data, you put it into raw, just line by line kick everything. You use a machine learning algorithm to classify which one sales which one's net income, try to identify those, you use Algorithms to look for data outliers to see if somebody fat-fingered a number in there. And then here you're going through manually piecing that data together. So that's where you're building out a data set that's structured out of the unstructured.

Another one though, would be Reading quite plainly natural language processing. So an interesting exercise we did was, We took all the 10ks that were out there and we ran it through this piece of technology called doctor vec. And doctor vec essentially vectorizes word space, it uses the probability loadings of the words and how close they are next to each other, And then turns it into basically a principal component analysis for a document, saying here's for all the 10ks that are out there and all the management discussion analysis sections, Turns it into a loading of numbers saying this one is, you lose the interpretability. You don't know what the numbers are applied to but you can group them and say these are similar on this one, These are similar on another vector that's out there, and then use that to analyze it.

What's really interesting by the way, I believe this is the technology that's behind google translation. So they did it on wiki and they did it for English and they did it for French. It turned out the loadings were the same so they could just map those along the way but what's interesting for me is that, When I started this you asked about change back in 2005, There was this barrier between what a fundamental analyst could do versus what a quantitative analyst could do. And a lot of that was just brute force information consumption, the idea of listening to the calls, reading the analyst reports. That's breaking down now. That's one where we're able to incorporate that into our process.

And it's going to be really interesting to see where this shakes out over the next call it five to ten years on, What you're able to incorporate because it's not just being able to read that you're reading the mdn section, We're able to read 3,000 of them in a minute, which is what you can't do on a fundamental side. So it's that question of where's the edge coming from on the fundamental versus quantitative processes?

Talk to me a little bit more about this pointy Dataset sort of the depth aspect, you mentioned things like esg, shipping manifests. It strikes me at least and there are a large number of vendors now, some of these sort of fall into the alternative space, but it strikes me that you have a balance there between, This idea of raw data and somewhat pre-processed data as you look to acquire those data sets. What's your preference? do you have a preference as to whether you're just getting everything raw and your team has the ability to build it up and structure it as you want, or do you want a vendor that's Cleaning and going through and giving it to you.

Hopefully in a more valuable format. So there's value-add propositions from vendors like on that shipping one that they have the manifest data is publicly available. What they feel their value-add is, is they've done the mapping of the physical name and address to a company. So there's a benefit to that, and so this is where from my seat when I talked about vendor management. There's always this balance of we're a boutique. We have a budget. We've got to work with how do you spend the money? And this idea of okay, they can bring this to us, but we could also scrape the manifest. So what's your value-add on top of this, and that's where oh that mapping exercise that's fairly difficult.

Same thing with something like a news analytics source where there are providers out there of just scraping the news. So scraping raw news, it's not a trivial exercise, but it's not that hard. There's some html processing and the rest. But that idea of mapping what an article is to an entity is something so there's a company talking about General Electric and Honeywell. Which is the main person of that article? So you've got to do an access of building that metal data layer and then there's an analytics part of it of saying, okay, what's the sentiment of this, etc.?

So the way I think look through this is there's a data processing aspect. Which by the way is the least fun part of our job. It's it's like I talked about the manually mapping of that S&P data and going through and just checking it and being like, okay, that number looks weird, that that is the not fun, don't talk to your kids over the dinner table on this part of it. But there's the analytics part of it, which you don't want to outsource. So if they wrap everything into and say by the way, here's the sentiment of that news article. That's where we say, you know, I would like to build my own sentiment. I would like to understand all the pieces for it and, again, to this platform and how it's changed. Which is more granularity on our side, more ability to build and do our own analytics on building these data points.

So it used to be one where you would work within a FactSet framework and you would just take their earnings yield number. And now we build our own earnings yield number from the pieces and just understanding all the parts from it. It's the same thing for these other data sets for just habit where? Again, you want more and more flexibility on how you're putting these things together.

So you mentioned some of these pointy data sets, you've mentioned a couple of tools, which I would consider to be more on the forefront of data exploration, things more in the machine learning side of the equation. But you also mentioned the concept of budget. You don't have an infinite amount of time. You don't have an infinite amount of money, how do you think about the problem of trying to integrate these new ideas and these new frontiers of both data and quantitative techniques when you don't necessarily know which will bear fruit for you. We have a huge graveyard of ideas. It's one where we have tested a number of things in research. It's a low percentage shot. You're going in with ideas. It's rare to find something that is truly great. There's a lot of the research we do that's incremental. So it's about trying to just rework problems you already have when you're going about trying a new idea. The most important thing we start with is just, is there an idea behind it? Is there something that intuitively makes sense? Is it something that should have the ability to forecast earnings or to be able to predict what's going on inside of a company?

And that's when we're again starting with the idea is the most important part. Some of this is where you would come up with ideas. Academic literature, doing surveys, and reviewing just trying to figure out what other people are doing. That can give you some idea of whether there's some meat on the bones for some of these ideas. My concern is academic pay readers. I'm, not sure if they're sharing as much as they used to. But you still do the surveys of literature to try to figure out what's out there.

But there's also this part of a formula that we put in place in the research team. Which is trying to understand what the value add potentially is the way that we've done. It is and every research template that somebody comes up with as a proposal. There's a modified information ratio. Which is essentially what's your expected return out of this or is it an expected decrease in transaction costs scale by the risk? Are you going to be increasing returns? Are you going to be lowering transaction costs? Or are you going to be or overall lowering the risk of the portfolio versus the benchmark? And then multiply that by the assets of what you think that you're going to impact for our existing client base. Or if it's a new product, where do you think this fits into our overall business?

So there's that modify call it view of just trying to quantify. What the potential impact of one of these are and then there's obviously the cost aspect. Which is if somebody's coming to us and saying i'll license you that news data source, by the way. It's a quarter of a million dollars a year. And we expect that to be on the other side where it's going to potentially. Keep us from buying and it's going to have like 20 basis points of impact on our trading model. We're like i'm not sure that's the right trade to make.

So one of the things about having such a. Strong research platform is it leads to obviously a lot of different research ideas, a lot of research projects. But as anyone who knows who's lived in the space long enough, the vast majority of those projects end up going nowhere. This idea of a research graveyard that you mentioned, which I love. That concept and I know, I've got a very expansive research graveyard. But mine just ends up in sort of this labyrinth of folders, this directory structure of failed projects.

And I was listening to Patrick on Ted Sidey's podcast the other day and he mentioned something really interesting, which was. When you guys have a research project, Even if there is a failure, even if you don't believe the research goes anywhere. You don't actually just drop it, you institutionalize the knowledge in your process, and you make sure. That the process is set up to continue tracking that research going forward, to make sure that that time series. Whatever is being evaluated, continues so that someone can return to it and see how things have worked out of sample. Can you speak to that a little bit? Because I thought it was a really fascinating extra step that I don't think a lot of firms go through. But I could see how incredibly valuable it would be to put in that extra legwork.

Yeah, it's one where again the investment is made on the. I think through technologies back from my old days in IT, which is. It's an asset that you build and then depreciate.

So the idea is that you keep track of these things you put together so like the ownership data. The first thing that we were trying to look at to see is that if owner operators have the ability to generate excess returns for their company, so somebody that owns 20 percent of the company or more and our research it's a sound idea. And it was one where we had a client who was asking us to take a look at this so we were exploring this and. We turned the data over and it was one of those frustrating ones where you felt like it was like. Maybe 80 basis points that was coming out of alpha inside of this. And then you flip it another way and it was like a little bit flatter and the rest. I was like, okay at some point you're like this doesn't feel like it's going to generate something significant for us. Still some interesting things that came out of it, things like private equity board membership and how that looks within companies. There are still some ideas that we're exploring along the way, but we put a pin in it. But that said we still have it where we are tracking that owner operator metric.

It's in our graveyard and it's one where we're saying let's see if there's anything that potentially comes around because there's also, say this through our initial analysis when you're going through and looking at a linear basis of saying. Does this one factor explain anything? No, there's also going to be, in my opinion, when you're going to sit there and incorporate more non-linear methods and looking at factor interactions. There's potential ways that that factor may not by itself do something, but it may be a conditional inside of something else. So there's a part about keeping track of it. Letting you still examine those factors, keeping as part of the graveyard, as I said. And even though you don't put it inside you still calculate it and have it every time you do the rebuild on your factor database.

I want to return to the tooling side of things as sort of three. Aspects you laid out: sort of the data, the tools, the people. Well, maybe both the data and the tools as you continue to expand the data in the platform. Whether that's incorporating new data sets or whether that's data sets that you guys have internally created and curated and are now tracking. As well as incorporate new tools, new ideas like machine learning. How do you think about managing sort of the actual? Operational burden of maintaining platform flexibility, this idea of you have a platform that worked four years ago. You start incorporating all these tools and all of a sudden it's this unwieldy duct tape together. Machine, how do you think about keeping it well oiled so that the team can keep producing at high efficiency?

first of all, it's a great point, which is that there's a way that. Things get clunky over time. But we have it where I think this is. At osam, four times that we rebuilt from scratch the platform where you learn from what is and you have it where you incorporate. All the best ideas and what you've learned from the data over time. And we went through that last one back in like 2017 where we refactored all the code and basically put it back together. Part of this is that we have a call it gold standard research area that we do and then we spin it out. So everybody has one on their laptop.

We bought the heaviest laptops you could possibly imagine for people on the research team. They're like 12 core 64 gig ram laptops that apparently are 20 volt. We didn't know this the first time, so you can't plug them in on a plane. So, That way Danny, one of our research analysts, learned that on his way to Thailand, or over to the Pacific, I forget where he's going, but he had an 18-hour flight and his laptop died in 40 minutes. He was like, "I can't do anything on this anymore."

So, but we have it where everybody kind of has their and they're literally called sandboxes. And the idea is you can mess up your sandbox all you want and what will happen is when you need it, you get a refresh from that gold standard and put it through. So, part of having it where you're having a research team that is able to work quickly and rapidly and this is a part of we believe in an entrepreneurial environment where you've got to work and move quickly and this is when we're having them with their own environment, their own tools.

That's one aspect of letting them run with those and build something out and proving it. There's a part about then translating that into production. So, that's a lot of also, that my job, that I feel like, is taking the work that comes from the research end of the desk and helping that get over to the tech end of the desk, where they're running that on a production on a daily basis.

So, a lot of that is, once they get it done, OK, how do we integrate it with this? What's the code base we're going to use, version controls, tap it into all of our source control. That's where best practices on it suddenly come in place. So, there's kind of this, work in your sandbox, get the results, be as messy as you want to, we're going to wipe that out and start over again.

And by the way, that sandbox idea is why the research part was so clean. We were able to take those sandboxes and just port them out for people, and say here's a copy of this, and go ahead and work with our derived data. You're not working with the raw, you're working with all the pieces of how we've cleaned it and put it together. And so everybody has their own environment that they're doing on that.

Then also, you're not working stepping on each other. We learned that probably five years ago when you have one big server and everybody's working off of it, particularly the latest CPU intensive algorithms, you can wind up having it where you have to share time on those and people got tired of me waiting for me. So, we shipped it out where everybody now has their own environment, that's kind of the long-winded way of saying how we go about managing these environments to try to, as you say, keep it as clean as possible but allowing people the flexibility to work and create when you're pulling apart a new data set or you're trying something new. There's going to be all these little widgets that you're building on the side, etc. So then we just pick those and figure out how to translate it back into production.

I hope you enjoyed this dive into the archives. If you did, leave us a rating or review and share with a friend. It helps us grow, and it means the world. Thanks for listening.


Transcribe another episode

Search for any podcast and get a full transcript sent to your email. First one is free.

Start transcribing