How is import_id's occurrence determined?

Hello there!

I'm currently working on a small script that will synchronize my bank transactions to YNAB through the API, and I'd like to have more details on import_id, especially the occurrence part.

I understand that import_id is meant to be a unique id that will prevent YNAB to import the same transaction multiple times.

Its format is "YNAB:[milliunit_amount]:[iso_date]:[occurrence]".

As far as I understand the documentation, occurrence is a sequential index starting from 1 for any amount:date pair, and incremented for every transaction that has the same amount:date pair.

The way I understand it, YNAB keeps a list of all imported transactions' import_id, and will refuse to reimport any transaction bearing the same import_id again (via any channel: file import, bank syncing, API).

What I can't guess without the help of YNAB engineers is how occurrence gets set on each transaction when a conflict arises (ie. when there's more than one transaction on a specific amount:date pair).

Here is a test I did to understand a bit better. I imported the two following CSVs, in this order:

Date,Payee,Memo,Outflow,Inflow
9/26/2018,,transfer to Peter,30.00,
9/26/2018,,cash withdrawal,30.00,
9/25/2018,,phone bill,20.00,
Date,Payee,Memo,Outflow,Inflow
9/26/2018,,new transaction,30.00,
9/26/2018,,transfer to Peter,30.00,
9/26/2018,,cash withdrawal,30.00,
9/25/2018,,phone bill,20.00,

I ended up with a duplicate cash withdrawal, whereas my new transaction was not imported.
From this result, I'm deducing that file import basically parses the file from top to bottom, and increments the occurrence id only when necessary, is that correct?
If so, I would also deduce that it is preferable for imported files to be sorted from oldest to newest, and for new transactions to appear at the bottom.
Otherwise a given transaction could accidentally get a different occurrence index over time.

This also means that, when I fetch a list of transactions from my bank account, or my credit card company, I need to assume that the order will be consistent over time, and that if a new transaction shows up for a specific day, then it will show at the bottom, and not disturb the other transactions' occurrence id.

Do you have any better advice? Are you doing anything smarter with bank syncing? Is there a way for me to make sure that bank syncing, file import and API will behave well together, and deduplication will be as accurate as it can be?
 

3replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • David Stosik - Your guessing is quite good here :)

    For each batch of transactions that YNAB imports from a file or from a bank directly, we first sort the transaction list by date (descending)then amount (descending).  Then, we assign an import_id to the transactions sequentially.  If we encounter a transaction that has the same date/amount that has been seen before in the same batch, we increment the occurrence number.  Once we've done this, we import those transactions with the assigned import_id.

    When importing transactions from a bank directly, we have an advantage that the transactions have a unique ID coming from the bank and are able to correlate those with previously imported transactions.  That is separate from the import_id assignment but supports our ability to detect new transactions.  With file-based import, we usually do not have this unique ID so we rely upon sorting of the transactions to ensure we assign an import_id consistently.  When importing through the API, it is obviously up to you how to handle this but for consistency with the other methods, using a consistent sorting of the transactions should be enough for most scenarios, I think.  Even including another field like Memo in the sorting would be helpful, I think, to ensure the sorting is stable.

    I hope this helps!  If you come up with something that might be helpful for others please post back here and let us know.

    Reply Like
  • Brady Thanks for the detailed reply, this is helpful! Unfortunately, I think your advice does not work:

    Even including another field like Memo in the sorting would be helpful, I think, to ensure the sorting is stable.

    Let's use again the list of transactions I used in my CSV example. 

    List 1:

    Date,Payee,Memo,Outflow,Inflow
    9/26/2018,,transfer to Peter,30.00,
    9/26/2018,,cash withdrawal,30.00,
    9/25/2018,,phone bill,20.00,

    List 2:

    Date,Payee,Memo,Outflow,Inflow
    9/26/2018,,new transaction,30.00,
    9/26/2018,,transfer to Peter,30.00,
    9/26/2018,,cash withdrawal,30.00,
    9/25/2018,,phone bill,20.00,

    Say I have an import script that runs twice a day, at noon and midnight. At noon, it gets the first list of transactions, and at midnight, it gets the second one. The transaction named "new transaction" was added to that list (maybe I made that payment around 2pm). Now let's assume I sort transactions by date, amount and memo before assigning them an occurrence number. The two lists will end up as follows:

    List 1:

    Date,Amount,Memo,occurrence
    9/25/2018,-20.00,phone bill,1
    9/26/2018,-30.00,cash withdrawal,1
    9/26/2018,-30.00,transfer to Peter,2
    

    List 2:

    Date,Amount,Memo,occurrence
    9/25/2018,-20.00,phone bill,1
    9/26/2018,-30.00,cash withdrawal,1
    9/26/2018,-30.00,new transaction,2
    9/26/2018,-30.00,transfer to Peter,3
    

    As you can see, the occurrence (and in consequence, the import_id too) is not stable over time: the value for the transaction titled "transfer to Peter" changed from 2 to 3. With this logic, the second import will miss "new transaction" and duplicate "transfer to Peter".

    When importing transactions from a bank directly, we have an advantage that the transactions have a unique ID coming from the bank and are able to correlate those with previously imported transactions.

    Do you mean that with bank import, you use something in addition to import_id for deduplication? Here are some ways I could see that help:

    • use the transaction id provided by the bank as occurrence, instead of an incremental number
    • use the transaction id exclusively as the whole import_id
    • use an incremental number as occurrence, but keep a store matching each transaction's id provided by the bank with the import_id it was assigned the first time it was imported, the ensure that occurrence is stable over time (ie. a given transaction will always be assigned the same occurrence value).

    I know the situation is unlikely to happen often, but if I want to implement a perfect import, will I need to implement my own deduplication logic which checks first what transactions exist, before deciding which ones it'll import?

    I'm still trying to wrap my head around some of the details in your answer, so here are some follow-up questions:

    • I think I now understand that although the import_id/occurrence logic might be meant to prevent duplicate imports, it will be doing a poor job over time, as it is never sure it's assigning the same import_id to transactions across batches, over time. (See my CSV experiment in the top post.) Is that correct?
    • Is there a chance I can cajole you into explaining us your bank import deduplication logic in enough details that we'll be able to replicate it when using the API?
    • It seems that, for bank import, you use another level of deduplication, using a transaction unique id provided by the bank. Any chance this could be exposed in the API? (If I know the transaction's unique id, I'd be happy to provide it in my API requests.)
    • Finally, do you think that following YNAB's import_id format is worth it for me? Wouldn't I benefit more from having my  own import_id format, which could include a bank transaction unique id, or a hash of the whole transaction (including the memo), and would be stable over time?
    Reply Like
  • David Stosik -

    I think I now understand that although the import_id/occurrence logic might be meant to prevent duplicate imports, it will be doing a poor job over time, as it is never sure it's assigning the same import_idto transactions across batches, over time. (See my CSV experiment in the top post.) Is that correct?

    That is correct.

    It seems that, for bank import, you use another level of deduplication, using a transaction unique id provided by the bank. Any chance this could be exposed in the API? (If I know the transaction's unique id, I'd be happy to provide it in my API requests.)

    I will look into seeing about getting that added but the unique ID we get from our integration provider is usually not coming directly from the bank but rather assigned from our aggregation provider so I don't think this would be helpful to you.

    Finally, do you think that following YNAB's import_id format is worth it for me? Wouldn't I benefit more from having my  own import_id format, which could include a bank transaction unique id, or a hash of the whole transaction (including the memo), and would be stable over time?

    You may use whatever import_id format you like and it sounds like this may be the best for your case.  The only thing to consider is that if you are importing transactions through the API and also have your account "Linked" to your bank in YNAB and/or use File-based import, you may have duplication issues because of the inconsistency of import_id.  But, if you are importing transactions through the API only, and it sounds like that is the case, you can use an import_id of your choosing.  If you have enough data on your side to ensure a unique import_id for each transaction, that would be a good approach.

    Reply Like 1
Like Follow
  • 8 mths agoLast active
  • 3Replies
  • 468Views
  • 2 Following