# Is Rattle still relevant for R programming

## STATISTIK-FORUM.de

### Frequency distribution of article combinations

Questions that do not relate to a specific procedure.

### Frequency distribution of article combinations

of stumble »Tue 28 Jun 2016, 5:37 pm

Hello everybody,

I have what I believe is a relatively trivial problem.
I have a data record (approx. 200,000 rows) that consists of a column with transaction numbers (order numbers) and a column with article numbers.
Due to orders that have several item positions, the left column logically contains the same numbers several times.

What do I want to find out?
Which items are most often bought in combination?

I don't want to set up any association rules - so no confidence intervals, support, lift, pipapo.
I would also like to initially neglect the number of items purchased per order.
I just want an overview in the direction of ...

Article combination

a - b 80
a - b - c 45
a - b - d 35
.
.
.

How can I achieve such an "article combination frequency distribution"?
This would certainly be possible with Excel, but with enormous effort - is there a simpler procedure here?

Best wishes,

Philip
stumble
Greenhorn

Posts: 6
Registered: Mon 27 Jun 2016, 7:32 pm
Thank you given: 1
Thank you get: 0 times in 0 post

### Re: Frequency distribution of article combinations

of structure puppet »Tue 28 Jun 2016, 10:27 p.m.

Hi,

This would certainly be possible with Excel, but with an enormous amount of effort - is there a simpler procedure here?

- Excel is not a process.
- Certainly something like this could be programmed quickly using the old programming language COBOL (Common Business Orientated Language)
- You can see whether there are still (free) compilers for COBOL
- But that can also be C or Pascal

greeting
S.
structure puppet
Sleepless in Seattle

Posts: 4058
Registered: Fri 17 Jun 2011, 9:15 pm
Thank you given: 31
Thank you get: 545 times in 542 posts

### Re: Frequency distribution of article combinations

of stumble »Wed Jun 29, 2016, 1:03 pm

Hello,

Thank you in advance for the input.
I had speculated on R. I have already tried association analyzes with R and Rattle.
The "problem" here was that I - which is in the nature of things - have to define key figures (support, cofidence)
to derive association rules. But I don't want to define or discover any rules, I just want one
Overview of all combinations with their frequency. If I enter a 0 in "Support", RStudio hangs up
(since he finds too many rules?!?).

Do you have an alternative solution for me with R?

Best wishes,

Philip
stumble
Greenhorn

Posts: 6
Registered: Mon 27 Jun 2016, 7:32 pm
Thank you given: 1
Thank you get: 0 times in 0 post

### Re: Frequency distribution of article combinations

of bele »Wed Jun 29, 2016, 1:42 pm

How are you supposed to program an R solution if you haven't even revealed in which form the data is available. The horse bridles from behind. Please give a precise description of what the data looks like, preferably a working minimal example as explained here: http://forum.r-statistik.de/viewtopic.p ... 920 # p16347

And a precise idea of ​​what exactly you want to know about the data. Is it really about a list of all combinations or do you actually want to ask for each item which has been bought most often with it or what exactly is the goal? Why do you want to "initially" neglect the number of items purchased? So that someone programs something for you and then you come around the corner with a changed request? This is not meant badly, but there are bad experiences behind it.

Oh yes, 200,000 lines are one thing. How many articles or article combinations do you have to expect? 10, 1000, 50,000 items? Can you just start programming or does it turn out to be a combinatorial task for a mainframe computer?

LG,
Bernhard
----
Oh, you can't help that, 'said the Cat: we're all mad here. I'm mad. You're mad. '
 How do you know I'm mad? '' said Alice.
You must be, 'said the Cat, or you wouldn't have come here.'
(Lewis Carol, Alice in Wonderland)
bele
Sleepless in Seattle

Posts: 4366
Registered: Thu 2nd Jun 2011, 10:16 pm
Thank you given: 10
Thank you get: 945 times in 934 posts

### Re: Frequency distribution of article combinations

of stumble »Wed Jun 29, 2016, 8:57 pm

I absolutely agree with you, sorry, I should have expressed myself more clearly.
My dataset is very similar to the attached sample. This sample contains 16,502 orders (unique identifier order number) to which 8,857 unique articles (unique identifier article number) are assigned.

Example: Article 6527670 is contained in 212 order numbers.
Question: Which articles appear most frequently in combination with the 6527670 (within an order number)?
Regarding the "is it really about a list of all combinations" comment a naive question: Would that be so computationally intensive? (because theoretically 10,502! combinations can occur?!?) ---> please do not get it wrong, I am currently something on the hose or underestimate the effort or the variety of possible combinations ...

Regarding the "initially" neglect the number of articles - note: "initially" was superfluous, the quantity sold is not of interest - only the combination of articles, regardless of the extent to which these are included in the orders.

So far I have tried the association analysis and the apriori algorithm:

library (arules)
sample_data <- read.transactions (file = 'sample_data.csv', rm.duplicates = T, format = 'single', sep = ',', cols = c (1,2))
rules <- apriori (sales_data, parameter = list (supp = 0.002, conf = 0.5)
inspect (rules)

I played around eagerly with supp and conf, but never came to a really satisfactory result - the same
by rattle. In addition, I couldn't figure out how to focus on a specific article in the process
can provide (target: 6527670).

Do you have a hint for me - maybe the association analysis is not really effective in this case.

Best wishes,

Philip

Unfortunately, I cannot upload a file attachment - does anyone have a tip?
stumble
Greenhorn

Posts: 6
Registered: Mon 27 Jun 2016, 7:32 pm
Thank you given: 1
Thank you get: 0 times in 0 post

### Re: Frequency distribution of article combinations

of bele »Wed Jun 29, 2016, 9:13 pm

Hello stolph,

stolph wrote:8,857 unique articles (unique identifier article number) [...]
Possibly underestimate the effort or the variety of possible combinations ...

Let us assume that each article can appear alone, with another, with two others or with three others, then we are already at 8.857 + 8857 * 8856 + 8857 * 8856 * 8855 + 8857 * 8856 * 8855 * 8854 = $6\times 10^{15}$ possible combinations. Much more with 5-way combinations. Just trying them all out is out of the question with your and my computer. It could have been that it was only about a limited number of conceivable combinations, then perhaps one would have had to put a little less brains into the evaluation algorithm. We have an upper limit of 200,000 lines and therefore a maximum of just as many combinations. That can be done.

Question: Which articles occur most frequently in combination with the 6527670?

That might be a different question than that of the list of all possible combinations. I could now imagine delivering a matrix in which the frequency of commonality with every other article counts for each article. Are you sure you want a list of all 3, 4, 5 combinations and not just the list of pair frequencies?

In addition, I couldn't figure out how to focus on a specific article in the process
can provide (target: 6527670).

What would be more useful for you, an endless list of combinations or the ability to query a single article?

Unfortunately, I cannot upload a file attachment - does anyone have a tip?

The best thing to do is to simulate data as in the above. Example or you invent some fantasy data in the correct format or you use an upload service like dropbox. A message to the admin would not be wrong either, because people fall into the trap with the unusable upload function again and again. Nothing has happened for a while.

LG,
Bernhard
----
Oh, you can't help that, 'said the Cat: we're all mad here. I'm mad. You're mad. '
 How do you know I'm mad? '' said Alice.
You must be, 'said the Cat, or you wouldn't have come here.'
(Lewis Carol, Alice in Wonderland)
bele
Sleepless in Seattle

Posts: 4366
Registered: Thu 2nd Jun 2011, 10:16 pm
Thank you given: 10
Thank you get: 945 times in 934 posts

### Re: Frequency distribution of article combinations

of stumble »Thu 30 Jun 2016, 3:26 pm

Hello bele,

Here is the link to the sample file:
https://www.dropbox.com/s/vv5tzian0zl28 ... a.csv? dl = 0

bele wrote:Are you sure you want a list of all 3, 4, 5 combinations and not just the list of pair frequencies?

In the present case, the pair frequencies would definitely be sufficient. In a special case, it would still be interesting for me to know what I have to do in order to come up with combinations of 3 or 4 (absolute maximum).

What would be more useful for you, an endless list of combinations or the ability to query a single article?

Another difficult question: I currently have about 10 articles in focus - I could work through them in the second way in sequence. In the case of an "unknown" data set, it would of course be advantageous if one could get an overview of the articles that appear most frequently (for example) in pairs. An endless list is definitely not useful. In this case, is there the possibility to limit the variety of combinations (e.g. with a limitation in the direction "the combination must occur at least 10 times - everything below is not of interest"?)

Best wishes,

Philip
stumble
Greenhorn

Posts: 6
Registered: Mon 27 Jun 2016, 7:32 pm
Thank you given: 1
Thank you get: 0 times in 0 post

### Re: Frequency distribution of article combinations

of bele »Fri Jul 1, 2016, 11:22 am

Hello stolph,

these are a whole series of wishes that I cannot now program all for a forum post. So I'll tackle the original problem, a list of all possible order combinations and their frequency. Maybe you can solve the rest yourself with these hints and the intermediate results.

First we read in the data, for this I use choose.files () because I don't know where the data is on your disk. In line 2 I use tapply to summarize the different orders and to sort the articles per order. In line three I summarize each order in a string in which the article numbers are connected in a clear order with a "+". In line 4, I use the table () function to count which combinations occur and how often. In line 5 I sort the most common combinations up:
Code: Select all

You can output the 30 most frequent combinations with their respective frequencies, for example:
Code: Select all

How many order combinations are there?
Code: Select all

A list of combinations that occur at least 10 times:
Code: Select all

There are 149 of these in the example data set.

Have fun with it,
Bernhard
----
Oh, you can't help that, 'said the Cat: we're all mad here. I'm mad. You're mad. '
 How do you know I'm mad? '' said Alice.
You must be, 'said the Cat, or you wouldn't have come here.'
(Lewis Carol, Alice in Wonderland)
bele
Sleepless in Seattle

Posts: 4366
Registered: Thu 2nd Jun 2011, 10:16 pm
Thank you given: 10
Thank you get: 945 times in 934 posts

the following users would like to thank bele:
stumble

### Re: Frequency distribution of article combinations

of stumble »Thu 7 Jul 2016, 9:19 am

Hello bele,

many, many thanks for your effort. That really helps me.
I've tried different things using your code and experimented with different data.
At the moment I still fail because of a fundamental problem of understanding:
bele wrote:You can output the 30 most frequent combinations with their respective frequencies, for example:
Code: Select all

The most common combination in this case is "6527670 + 6790904".
If the combination "6527670 + 6790904 + XYZ" would occur in the frequency distribution - the frequency is the "6527670 + 6790904 + XYZ"
then included in the "6527670 + 6790904"?

Best wishes,

Philip
stumble
Greenhorn

Posts: 6
Registered: Mon 27 Jun 2016, 7:32 pm
Thank you given: 1
Thank you get: 0 times in 0 post

### Re: Frequency distribution of article combinations

of bele »Thu 7 Jul 2016, 12:21 pm

Hello Phillip,

I thought you were gone. No, in this case each combination is considered separately. But with this question in the background, you may now understand better why I previously insisted on clarifying the requirements as precisely as possible.

If you are looking for all combinations in which "6527670" and "6790904" occur, you can find that using subset () and the string functions of R. At this point you have to deal a little with "regular expressions" and it will probably be a bit easier if you look at the package "stringr" and the function string_detect () in it: https: // cran. r-project.org/web/packages ... ringr.html

LG,
Bernhard
----
Oh, you can't help that, 'said the Cat: we're all mad here. I'm mad. You're mad. '
 How do you know I'm mad? '' said Alice.
You must be, 'said the Cat, or you wouldn't have come here.'
(Lewis Carol, Alice in Wonderland)
bele
Sleepless in Seattle

Posts: 4366
Registered: Thu 2nd Jun 2011, 10:16 pm
Thank you given: 10
Thank you get: 945 times in 934 posts

Back to general questions

### Who's Online?

Members browsing this forum: Google [Bot] and 1 guest